-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA_ERROR_MISALIGNED_ADDRESS on MNIST example #2810
Comments
Swapping my action with Benoit. All the misaligned memory reads came from Eigen kernels in the attached error messages. |
Thanks for your help, but I'm still lost. I'm a user, not a programmer. I'm a statistician. So I'm not sure what to do to fix this problem. Does "All the misaligned memory reads came from Eigen kernels" mean I've done something wrong? If so, what have I done wrong? If not, then what can I do to get tensorflow working on my machine? |
@johnfrombluff GREETINGS. I don't know what misaligned memory reads came from Eigen kernels Do you have a basic block size like (or such as) 256 , 512, 1024 |
Sorry, I'm still confused. What block size are you referring to? Which file(s) should I look at to find what you're talking about? I'm trying to run example code that comes with the tensorflow distribution. Shouldn't that code run on all supported architectures? Maybe GNU/Linux or my GPU is not supported, but I haven't noticed that in the documentation? And thank you for your attempt to help me! |
@johnfrombluff, my earlier comment only meant to point out which part of the program is triggering the error to my colleague. It didn't imply you did something wrong. Your GPU is GTX 750 Ti, which is gm107. It is supported in theory. But it is a low-end GPU, out of which you might not see a very big speedup. Since we never saw this problem before, and were unable to reproduce it, the only way to root cause is to ask you to run experiments. However, some steps are not the easiest for users who are not familiar with GPU programming. Alternatively, you can try a different GPU. Both Titan-X and GTX 1080 are very popular choices. If you still see the same problem with the latest Cuda driver, Cuda SDK and more recent GPU, we would definitely like to investigate. |
@zheng-xq, I have a similar setup on Ubuntu 14.04.4 LTS: Cuda v7.5, Cudnn v4, I went one step further and tried another tf examples, such as alexnet, imagenet, cifar10_multi_gpu_train. It seems they run OK, see attached log, and the problem is in the code of |
@zheng-xq See the same error when running MNIST test. Also on Ubuntu 14.04, Cuda v7.5, Cudnn v4. Use the nvidia-docker using this image. This is using a GTX 960M (use it for sanity checks before spinning up servers). I'm calling via Keras MNIST example. Same example works fine using Theano backend (via Keras configuration). |
I get the same a similar error from running To reproduce:
Result:
|
Same error trying to run basic MNIST example. OUTPUT: Process finished with exit code 134 |
I tried running the mnist example after I installed TensorFlow in virtualenv and I got the same error, Ubuntu 16, gcc 5.3.1, python 3.5.1, Driver Version: 361.42, cuda 7.5, this time with a GTX960 with 4GiB, which should be more than enough for this network model:
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX 960M
major: 5 minor: 0 memoryClockRate (GHz) 1.176
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.33GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0)
Initialized!
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS
F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:198] Unexpected Event status: 1
[1] 25066 abort (core dumped) python -m tensorflow.models.image.mnist.convolutional edit: Running |
I've run into the same problem exactly as described by floringogianu except w/ Ubuntu 16.04 and gcc 4.9. Also, i used the --override flag when installing cuda toolkit via the .run script, which may or may not be relevant. cifar10 runs fine. |
I've been able to circumvent the issue by following this setup, which includes switching to python 3.x. and the associated binary. I also noted that the default build does not target nvidia compute capability 5.0 (that of the GTX 960M), but it works anyways. I attempted to build from source myself but ran into some linking errors which I don't have time to address at the moment. Others have had trouble building for Ubuntu 16.04 as documented in this thread, with some success. I'd be interested in knowing if anyone else succeeds at this and having that issue closed. |
Same problem here. Downgraded to gcc 4.9 to use theano, now tensorflow is broken with CUDA_MISALIGNED_ADDRESS. (bottleneck generation is rapidly improved though) |
@johnfrombluff, @tsitsilin, @acowlikeobject, @kalleknast, @dzupin, @floringogianu, @MartianWearables, sorry that we cannot reproduce this problem on our side. I will try to guess where the problem is and see whether it could be fixed. Among folks who encountered this problem, what is common is that all used gm107 and gm108 based GPUs. That is compute capability 5.0. TensorFlow binary by default carries compute capability 3.5 and 5.2. The Cuda driver will extract the compute 3.5 PTX and JIT compile into compute 5.0 SASS upon the first run. Given the error message is "Invalid local read of size 16", my current guess is that the JIT compiler in the Cuda driver is generating wrong code for tf.nn.softmax on GPUs with compute capability 5.0. Here are a number of things to try:
If #1 still fails, we can dump the SASS code from your binary and see what goes wrong. |
From an offline conversation, we can confirm that this problem goes away:
So it does seem like a JIT compiler issue that goes away the latest driver. |
To expand @zheng-xq fix:
edit: Updating the driver seems not to be that easy (see ask.SE question). @zheng-xq Could you please add some details how to build tensorflow setting it explicitly to 5.0? Is it possible to build tensorflow when one installed CUDA via |
@MartinThoma, you can set compute version for 5.0 via "configure".
The same thing applies to Cuda and Cudnn paths.
What does the structure of the Cuda binaries look like when you do apt-get? You can download and install directly from NVIDIA. If that is not possible, if the file directory is similar, you can pass that directory to "configure". If the file directory is completely different for some reason, I guess you can build another directory with symlinks that mimics the downloaded Cuda directory. https://developer.nvidia.com/cuda-downloads Hope that helps. |
Here are some of the files:
When I think I'll just install it manually, because I run into those problems quite regularly. |
If those are not symlinks, I would recommend you to manually reinstall that. |
Was facing same problem, get "CUDA_ERROR_MISALIGNED_ADDRESS" with MNIST samples. Below are the environment versions. Downgraded tensorflow to 0.8 and this error does not show up. However, started facing a new problem. TF would just hang, nvidia-smi shows temperature at 68 C and the process stops responding. It should probably the driver issue. Installed all the latest versions (except TF) and its all fine now. |
I have the same error: Env: |
Summary
What might be causing this error when running python tensorflow/models/image/mnist/convolutional.py?
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS
Environment info
Operating System:
Linux Lounge 4.5.6-200.fc23.x86_64 #1 SMP Wed Jun 1 21:28:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Installed version of CUDA and cuDNN:
(please attach the output of
ls -l /path/to/cuda/lib/libcud*
):ls -l /usr/local/cuda-7.5/lib64/libcud*
-rw-r--r--. 1 root root 322936 Aug 16 2015 /usr/local/cuda-7.5/lib64/libcudadevrt.a
lrwxrwxrwx. 1 root root 16 Aug 16 2015 /usr/local/cuda-7.5/lib64/libcudart.so -> libcudart.so.7.5
lrwxrwxrwx. 1 root root 19 Aug 16 2015 /usr/local/cuda-7.5/lib64/libcudart.so.7.5 -> libcudart.so.7.5.18
-rwxr-xr-x. 1 root root 383336 Aug 16 2015 /usr/local/cuda-7.5/lib64/libcudart.so.7.5.18
-rw-r--r--. 1 root root 720192 Aug 16 2015 /usr/local/cuda-7.5/lib64/libcudart_static.a
-rwxr-xr-x. 1 root root 61453024 Jun 11 12:35 /usr/local/cuda-7.5/lib64/libcudnn.so
-rwxr-xr-x. 1 root root 61453024 Jun 11 12:35 /usr/local/cuda-7.5/lib64/libcudnn.so.4
-rwxr-xr-x. 1 root root 61453024 Jun 11 12:35 /usr/local/cuda-7.5/lib64/libcudnn.so.4.0.7
-rwxr-xr-x. 1 root root 59909104 Jun 11 12:35 /usr/local/cuda-7.5/lib64/libcudnn.so.5
-rwxr-xr-x. 1 root root 59909104 Jun 11 12:35 /usr/local/cuda-7.5/lib64/libcudnn.so.5.0.5
-rw-r--r--. 1 root root 62025862 Jun 11 12:35 /usr/local/cuda-7.5/lib64/libcudnn_static.a
If installed from binary pip package, provide:
1. Which pip package you installed.
2. The output from
python -c "import tensorflow; print(tensorflow.__version__)"
.python -c "import tensorflow; print(tensorflow.version)"
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
If installed from sources, provide the commit hash:
Steps to reproduce
1 python tensorflow/models/image/mnist/convolutional.py.
2. Observe errror CUDA_ERROR_MISALIGNED_ADDRESS
3. Scratch head
What have you tried?
Logs or other output that would be helpful
(If logs are large, please upload as attachment).
Results of cuda-memcheck and dmesg
error.txt
The text was updated successfully, but these errors were encountered: