Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_MISALIGNED_ADDRESS on MNIST example #2810

Closed
johnfrombluff opened this issue Jun 11, 2016 · 21 comments
Closed

CUDA_ERROR_MISALIGNED_ADDRESS on MNIST example #2810

johnfrombluff opened this issue Jun 11, 2016 · 21 comments
Assignees
Labels

Comments

@johnfrombluff
Copy link

johnfrombluff commented Jun 11, 2016

Summary

What might be causing this error when running python tensorflow/models/image/mnist/convolutional.py?

E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS

Environment info

Operating System:
Linux Lounge 4.5.6-200.fc23.x86_64 #1 SMP Wed Jun 1 21:28:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*):
ls -l /usr/local/cuda-7.5/lib64/libcud*
-rw-r--r--. 1 root root 322936 Aug 16 2015 /usr/local/cuda-7.5/lib64/libcudadevrt.a
lrwxrwxrwx. 1 root root 16 Aug 16 2015 /usr/local/cuda-7.5/lib64/libcudart.so -> libcudart.so.7.5
lrwxrwxrwx. 1 root root 19 Aug 16 2015 /usr/local/cuda-7.5/lib64/libcudart.so.7.5 -> libcudart.so.7.5.18
-rwxr-xr-x. 1 root root 383336 Aug 16 2015 /usr/local/cuda-7.5/lib64/libcudart.so.7.5.18
-rw-r--r--. 1 root root 720192 Aug 16 2015 /usr/local/cuda-7.5/lib64/libcudart_static.a
-rwxr-xr-x. 1 root root 61453024 Jun 11 12:35 /usr/local/cuda-7.5/lib64/libcudnn.so
-rwxr-xr-x. 1 root root 61453024 Jun 11 12:35 /usr/local/cuda-7.5/lib64/libcudnn.so.4
-rwxr-xr-x. 1 root root 61453024 Jun 11 12:35 /usr/local/cuda-7.5/lib64/libcudnn.so.4.0.7
-rwxr-xr-x. 1 root root 59909104 Jun 11 12:35 /usr/local/cuda-7.5/lib64/libcudnn.so.5
-rwxr-xr-x. 1 root root 59909104 Jun 11 12:35 /usr/local/cuda-7.5/lib64/libcudnn.so.5.0.5
-rw-r--r--. 1 root root 62025862 Jun 11 12:35 /usr/local/cuda-7.5/lib64/libcudnn_static.a

If installed from binary pip package, provide:

1. Which pip package you installed.

export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.9.0rc0-cp27-none-linux_x86_64.whl
pip install --upgrade $TF_BINARY_URL

2. The output from python -c "import tensorflow; print(tensorflow.__version__)".
python -c "import tensorflow; print(tensorflow.version)"
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally

If installed from sources, provide the commit hash:

Steps to reproduce

1 python tensorflow/models/image/mnist/convolutional.py.
2. Observe errror CUDA_ERROR_MISALIGNED_ADDRESS
3. Scratch head

What have you tried?

  1. Searching the internet for clues, none found

Logs or other output that would be helpful

(If logs are large, please upload as attachment).
Results of cuda-memcheck and dmesg
error.txt

@zheng-xq
Copy link
Contributor

Swapping my action with Benoit. All the misaligned memory reads came from Eigen kernels in the attached error messages.

@johnfrombluff
Copy link
Author

Thanks for your help, but I'm still lost. I'm a user, not a programmer. I'm a statistician. So I'm not sure what to do to fix this problem. Does "All the misaligned memory reads came from Eigen kernels" mean I've done something wrong? If so, what have I done wrong? If not, then what can I do to get tensorflow working on my machine?

@bryonglodencissp
Copy link
Contributor

@johnfrombluff GREETINGS. I don't know what misaligned memory reads came from Eigen kernels

Do you have a basic block size like (or such as) 256 , 512, 1024

@johnfrombluff
Copy link
Author

Sorry, I'm still confused. What block size are you referring to? Which file(s) should I look at to find what you're talking about?

I'm trying to run example code that comes with the tensorflow distribution. Shouldn't that code run on all supported architectures? Maybe GNU/Linux or my GPU is not supported, but I haven't noticed that in the documentation?

And thank you for your attempt to help me!

@zheng-xq
Copy link
Contributor

@johnfrombluff, my earlier comment only meant to point out which part of the program is triggering the error to my colleague. It didn't imply you did something wrong.

Your GPU is GTX 750 Ti, which is gm107. It is supported in theory. But it is a low-end GPU, out of which you might not see a very big speedup.

Since we never saw this problem before, and were unable to reproduce it, the only way to root cause is to ask you to run experiments. However, some steps are not the easiest for users who are not familiar with GPU programming.

Alternatively, you can try a different GPU. Both Titan-X and GTX 1080 are very popular choices. If you still see the same problem with the latest Cuda driver, Cuda SDK and more recent GPU, we would definitely like to investigate.

@tsitsilin
Copy link

@zheng-xq, I have a similar setup on Ubuntu 14.04.4 LTS: Cuda v7.5, Cudnn v4, /gpu/tensorflow-0.9.0rc0-cp27-none-linux_x86_64.whl with GTX750 Ti and see the same issue on the mnist example:
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS.

I went one step further and tried another tf examples, such as alexnet, imagenet, cifar10_multi_gpu_train. It seems they run OK, see attached log, and the problem is in the code of convolutional.py.

CUDA_ERROR_MISALIGNED_ADDRESS.txt

@acowlikeobject
Copy link

@zheng-xq See the same error when running MNIST test.
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS

Also on Ubuntu 14.04, Cuda v7.5, Cudnn v4. Use the nvidia-docker using this image.

This is using a GTX 960M (use it for sanity checks before spinning up servers).

I'm calling via Keras MNIST example. Same example works fine using Theano backend (via Keras configuration).

cuda-memcheck.txt
environment.txt

@kalleknast
Copy link

I get the same a similar error from running tf.nn.softmax().
I can give more details if this seems relevant.

To reproduce:

import tensorflow as tf
logits = tf.random_normal((10, 2))
y = tf.nn.softmax(logits)

with tf.Session() as sess:
print(y.eval())

Result:

I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: name: Quadro K2200 major: 5 minor: 0 memoryClockRate (GHz) 1.124 pciBusID 0000:03:00.0 Total memory: 3.99GiB Free memory: 3.20GiB I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K2200, pci bus id: 0000:03:00.0) E tensorflow/stream_executor/cuda/cuda_driver.cc:1140] could not synchronize on CUDA context: CUDA_ERROR_MISALIGNED_ADDRESS :: No stack trace available E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:198] Unexpected Event status: 1 F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed Aborted (core dumped)

@prb12 prb12 added bug labels Jun 20, 2016
@dzupin
Copy link

dzupin commented Jun 28, 2016

Same error trying to run basic MNIST example.
Linux Y15MATE 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

OUTPUT:
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce 840M
major: 5 minor: 0 memoryClockRate (GHz) 1.124
pciBusID 0000:06:00.0
Total memory: 2.00GiB
Free memory: 1.84GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce 840M, pci bus id: 0000:06:00.0)
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS
F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:198] Unexpected Event status: 1

Process finished with exit code 134

@floringogianu
Copy link

floringogianu commented Jun 29, 2016

I tried running the mnist example after I installed TensorFlow in virtualenv and I got the same error, Ubuntu 16, gcc 5.3.1, python 3.5.1, Driver Version: 361.42, cuda 7.5, this time with a GTX960 with 4GiB, which should be more than enough for this network model:

python -m tensorflow.models.image.mnist.convolutional

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce GTX 960M
major: 5 minor: 0 memoryClockRate (GHz) 1.176
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.33GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0)
Initialized!
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS
F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:198] Unexpected Event status: 1
[1]    25066 abort (core dumped)  python -m tensorflow.models.image.mnist.convolutional

edit: Running cifar10 model seems to be working just fine...

@MartianWearables
Copy link

I've run into the same problem exactly as described by floringogianu except w/ Ubuntu 16.04 and gcc 4.9. Also, i used the --override flag when installing cuda toolkit via the .run script, which may or may not be relevant. cifar10 runs fine.

@MartianWearables
Copy link

I've been able to circumvent the issue by following this setup, which includes switching to python 3.x. and the associated binary.

I also noted that the default build does not target nvidia compute capability 5.0 (that of the GTX 960M), but it works anyways. I attempted to build from source myself but ran into some linking errors which I don't have time to address at the moment. Others have had trouble building for Ubuntu 16.04 as documented in this thread, with some success. I'd be interested in knowing if anyone else succeeds at this and having that issue closed.

@Waffleboy
Copy link

Same problem here. Downgraded to gcc 4.9 to use theano, now tensorflow is broken with CUDA_MISALIGNED_ADDRESS. (bottleneck generation is rapidly improved though)

@zheng-xq
Copy link
Contributor

zheng-xq commented Jul 7, 2016

@johnfrombluff, @tsitsilin, @acowlikeobject, @kalleknast, @dzupin, @floringogianu, @MartianWearables, sorry that we cannot reproduce this problem on our side. I will try to guess where the problem is and see whether it could be fixed.

Among folks who encountered this problem, what is common is that all used gm107 and gm108 based GPUs. That is compute capability 5.0. TensorFlow binary by default carries compute capability 3.5 and 5.2. The Cuda driver will extract the compute 3.5 PTX and JIT compile into compute 5.0 SASS upon the first run. Given the error message is "Invalid local read of size 16", my current guess is that the JIT compiler in the Cuda driver is generating wrong code for tf.nn.softmax on GPUs with compute capability 5.0.

Here are a number of things to try:

  1. Enable compute capability 5.0 directly when building from the source code. It is part of the "configure". This would enable SASS 5.0 from the static Cuda compiler, and bypasses the JIT Cuda compiler in the Cuda driver.
  2. Install the latest driver from NVIDIA.

If #1 still fails, we can dump the SASS code from your binary and see what goes wrong.

@zheng-xq
Copy link
Contributor

From an offline conversation, we can confirm that this problem goes away:

  1. Build from source while explicitly setting 5.0 build target.
  2. Or install the latest graphics driver 367.27.

So it does seem like a JIT compiler issue that goes away the latest driver.

@MartinThoma
Copy link

MartinThoma commented Jul 15, 2016

To expand @zheng-xq fix:

edit: Updating the driver seems not to be that easy (see ask.SE question). @zheng-xq Could you please add some details how to build tensorflow setting it explicitly to 5.0? Is it possible to build tensorflow when one installed CUDA via apt-get (and thus does not have one cuda folder)?

@zheng-xq
Copy link
Contributor

@MartinThoma, you can set compute version for 5.0 via "configure".

Please specify a list of comma-separated Cuda compute capabilities you want to
build with. You can find the compute capability of your device at:
https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your
build time and binary size. [Default is: "3.5,5.2"]: 3.5

The same thing applies to Cuda and Cudnn paths.

Please specify the location where CUDA 7.5 toolkit is installed. Refer to
README.md for more details. [default is: /usr/local/cuda]: /usr/local/cuda

What does the structure of the Cuda binaries look like when you do apt-get? You can download and install directly from NVIDIA. If that is not possible, if the file directory is similar, you can pass that directory to "configure". If the file directory is completely different for some reason, I guess you can build another directory with symlinks that mimics the downloaded Cuda directory.

https://developer.nvidia.com/cuda-downloads

Hope that helps.

@MartinThoma
Copy link

What does the structure of the Cuda binaries look like when you do apt-get?

Here are some of the files:

/usr/bin/nvcc
/usr/bin/nvidia-smi
/usr/lib/x86_64-linux-gnu/libcudadevrt.a
/usr/include/cudnn.h

When configure of tensorflow asks me where the cuda folder is, I pointed to /usr/lib/x86_64-linux-gnu/. But then it complained that there is no /usr/lib/x86_64-linux-gnu/lib64 (or something similar).

I think I'll just install it manually, because I run into those problems quite regularly.

@zheng-xq
Copy link
Contributor

If those are not symlinks, I would recommend you to manually reinstall that.

@ra9hur
Copy link

ra9hur commented Jul 20, 2016

Was facing same problem, get "CUDA_ERROR_MISALIGNED_ADDRESS" with MNIST samples. Below are the environment versions.
Ubuntu - 16.04
Driver - 361.45 or 364.19
CUDA - 7.5
CUDNN - 4.0
TF - 0.9
gcc - 4.9

Downgraded tensorflow to 0.8 and this error does not show up. However, started facing a new problem. TF would just hang, nvidia-smi shows temperature at 68 C and the process stops responding. It should probably the driver issue. Installed all the latest versions (except TF) and its all fine now.
Ubuntu - 16.04
Driver - 367.35
CUDA - 8.0 RC
CUDNN - 5.0
TF - 0.8
gcc - 5.41 (default that comes with Ubuntu 16.04 install)

@aselle aselle added type:bug Bug and removed bug labels Feb 9, 2017
@zhanglistar
Copy link

I have the same error:
017-09-07 11:20:05.454046: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS 2017-09-07 11:20:05.454119: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 2017-09-07 11:20:05.454124: E tensorflow/stream_executor/cuda/cuda_blas.cc:365] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED 2017-09-07 11:20:05.454183: W tensorflow/stream_executor/stream.cc:1601] attempting to perform BLAS operation using StreamExecutor without BLAS support ^[tools/jumbo/bin/python4.8: line 11: 32638 Aborted /opt/compiler/gcc-4.8.2/lib/ld-linux-x86-64.so.2 --library-path $SCRIPTPATH/../lib:/opt/compiler/gcc-4.8.2/lib:$LD_LIBRARY_PATH $SCRIPTPATH/python "$@"

Env:
GPU: Tesla P40
Driver: 375.20
cudnn: 5.1.10
gcc: 4.8.2
os: centos 4.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests