Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed call to cuInit: CUDA_ERROR_UNKNOWN in python programs using Ubuntu bumblebee #394

Closed
jpmerc opened this issue Dec 2, 2015 · 59 comments
Assignees

Comments

@jpmerc
Copy link

jpmerc commented Dec 2, 2015

I have a Quadro K1100M integrated gpu with compute capability 3.0. I had to install bumblebee to make CUDA work. I am now able to run the tutorials_example_trainer with the command sudo optirun bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu. I have been able to do that with TF_UNOFFICIAL_SETTING=1 ./configure. However, I am not able to run examples in python directly.

For example, if I run the convolutional.py in tensorflow/models/image/mnist with the command optirun python convolutional.py, I get the following error :

tensorflow/tensorflow/models/image/mnist$ optirun python convolutional.py 
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
E tensorflow/stream_executor/cuda/cuda_driver.cc:466] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:98] retrieving CUDA diagnostic information for host: jp-pc
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:106] hostname: jp-pc
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:131] libcuda reported version is: 352.63
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:242] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  352.63  Sat Nov  7 21:25:42 PST 2015
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) 
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:135] kernel reported version is: 352.63
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:211] kernel version seems to match DSO: 352.63
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8

It is like my gpu is not recognized in python programs because of the 3.0 compute capability. Is there a way to avoid this problem?

@vrv vrv changed the title failed call to cuInit: CUDA_ERROR_UNKNOWN in python programs failed call to cuInit: CUDA_ERROR_UNKNOWN in python programs using Ubuntu bumblebee Dec 2, 2015
@vrv
Copy link

vrv commented Dec 2, 2015

Updated subject to reflect the environment you're trying to run in. Hopefully someone in the community who knows more about bumblebee/optimus laptops might be able to help!

@jpmerc
Copy link
Author

jpmerc commented Dec 2, 2015

Just to be clear, I am able to run any programs with Bazel build. However, when using simple python programs where the tensorflow library is imported, the gpu will not work and I am stuck with using only the cpus.

@girving
Copy link
Contributor

girving commented Dec 7, 2015

@vrv: Assigning you since it doesn't let me assign zheng-xq. Do you know why?

@vrv
Copy link

vrv commented Dec 7, 2015

@girving: as discussed offline, fixing that.

Yeah, CUDA_ERROR_UNKNOWN is not very helpful. hopefully @zheng-xq might know more what's going on here

@girving girving added the cuda label Dec 7, 2015
@zheng-xq
Copy link
Contributor

@jpmerc, could you run your command line through sudo, similar to you C++ examples? I wonder whether it is the root access that is making the difference. The initialization logic should be the same between C++ and Python clients.

@jpmerc
Copy link
Author

jpmerc commented Dec 11, 2015

It doesn't seem to find the cuda library when in sudo :

$ sudo optirun python convolutional.py 

Traceback (most recent call last):
  File "convolutional.py", line 30, in <module>
    import tensorflow.python.platform
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/__init__.py", line 23, in <module>
    from tensorflow.python import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python.client.client_lib import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/client_lib.py", line 54, in <module>
    from tensorflow.python.client.session import InteractiveSession
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 31, in <module>
    from tensorflow.python import pywrap_tensorflow as tf_session
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 28, in <module>
    _pywrap_tensorflow = swig_import_helper()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description)
ImportError: libcudart.so.7.0: cannot open shared object file: No such file or directory

@zheng-xq
Copy link
Contributor

Interesting. Could you add the path to your Cuda 7.0 runtime to LD_LIBRARY_PATH?

@jpmerc
Copy link
Author

jpmerc commented Dec 11, 2015

It is already in LD_LIBRARY_PATH.

In LD_LIBRARY_PATH I have :
/usr/local/cuda-7.0/lib64
/usr/local/cuda-7.0/lib

The library the program is looking for is there :
/usr/local/cuda-7.0/lib/libcudart.so.7.0
/usr/local/cuda-7.0/lib64/libcudart.so.7.0

@PeterBeukelman
Copy link

Did someone find an answer to this?
When running word2vec_basic.py I get

I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
E tensorflow/stream_executor/cuda/cuda_driver.cc:466] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:98] retrieving CUDA diagnostic information for host: peter-linux
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:106] hostname: peter-linux
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:131] libcuda reported version is: 352.63
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:242] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015
GCC version: gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:135] kernel reported version is: 352.63
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:211] kernel version seems to match DSO: 352.63
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA:
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8

It runs, but probably without GPU except for the end when it complains:
Please install sklearn and matplotlib to visualize embeddings.
Unfortunately sklearn and matplotlib are installed.
That being said, I see matplotlib fails selftest. numpy version 1.10.2

@martinwicke
Copy link
Member

Could you run nvidia-debugdump -l or nvidia-smi and paste the output? I had a similar problem and in the end it was a lack of power for the graphics card.

@PeterBeukelman
Copy link

Found 1 NVIDIA devices
Device ID: 0
Device name: GeForce GTX TITAN X (*PrimaryCard)
GPU internal ID: 0420115018258

Tue Dec 15 23:56:17 2015
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:04:00.0 On | N/A |
| 22% 33C P8 17W / 250W | 441MiB / 12287MiB | 1% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1270 G /usr/bin/X 174MiB |
| 0 2183 G compiz 112MiB |
| 0 2575 G ...ves-passed-by-fd --v8-snapshot-passed-by- 127MiB |
+-----------------------------------------------------------------------------+

@martinwicke
Copy link
Member

Well, it was worth a shot.

On Tue, Dec 15, 2015 at 3:28 PM PeterBeukelman notifications@github.com
wrote:

Found 1 NVIDIA devices
Device ID: 0
Device name: GeForce GTX TITAN X (*PrimaryCard)
GPU internal ID: 0420115018258

Tue Dec 15 23:56:17 2015

+------------------------------------------------------+

| NVIDIA-SMI 352.63 Driver Version: 352.63 |

|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:04:00.0 On | N/A |
| 22% 33C P8 17W / 250W | 441MiB / 12287MiB | 1% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |

|=============================================================================|
| 0 1270 G /usr/bin/X 174MiB |
| 0 2183 G compiz 112MiB |
| 0 2575 G ...ves-passed-by-fd --v8-snapshot-passed-by- 127MiB |

+-----------------------------------------------------------------------------+


Reply to this email directly or view it on GitHub
#394 (comment)
.

@zheng-xq
Copy link
Contributor

@jpmerc, could you try to set LD_LIBRARY_PATH inside your sudo? That should make sudo preserve the environment variables.

sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-7.0/lib64 optirun python convolutional.py

@PeterBeukelman, to make sure you have the same problem, could you run the C++ tutorial?

bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

@PeterBeukelman
Copy link

I think the basic mnist worked with GPU early on. But I ran into problem with word2vec visualization. Somewhere down the search to fix that, I noticed 2 different driver versions being mentioned 346 and 352. This made me think I mistakenly updated the driver and tried to revert to 346. I purged everything but after install still ended up with 352.
I also had the wrong cuda 7.5 installed initially. I have made a link from /usr/local/cuda-7.0 to /usr/local/cuda
LD_LIBRARY_PATH
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib/python2.7/dist-packages/tensorflow/python/:
For bazel build, I had the problem
<INFO: From Compiling tensorflow/core/kernels/cwise_op_gpu_cos.cu.cc:
<In file included from third_party/gpus/cuda/include/cuda_runtime.h:62:0,
< from :0:
<third_party/gpus/cuda/include/host_config.h:105:2: error: #error -- unsupported GNU version! gcc <4.10 and up are not supported!
< #error -- unsupported GNU version! gcc 4.10 and up are not supported!
< ^
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: output <'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/cwise_op_gpu_cos.cu.o' was not created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: output <'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/cwise_op_gpu_cos.cu.d' was not created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: not all outputs were created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: output <'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/constant_op_gpu.cu.o' was not created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: output <'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/constant_op_gpu.cu.d' was not created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: not all outputs were created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: output <'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/cwise_op_gpu_ceil.cu.o' was not created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: output <'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/cwise_op_gpu_ceil.cu.d' was not created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: not all outputs were created.

I had this before also and dug up some old version of gcc (3.4) if I use that version
< gcc: unrecognized option `-no-canonical-prefixes'
< gcc: .: linker input file unused because linking not done
< gcc: bazel-out/local_linux-opt/genfiles: linker input file unused because linking not done
< cc1plus: error: unrecognized command line option "-iquote"
< cc1plus: error: unrecognized command line option "-iquote"
< cc1plus: error: unrecognized command line option "-fstack-protector"
< cc1plus: error: unrecognized command line option "-Wno-free-nonheap-object"
< cc1plus: error: unrecognized command line option "-Wno-builtin-macro-redefined"
< cc1plus: error: unrecognized command line option "-std=c++11"
< cc1plus: .: No such file or directory
< ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:115:1: C++ compilation of rule '//tensorflow/core:direct_session' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object ... (remaining 59 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
< Target //tensorflow/cc:tutorials_example_trainer failed to build

I recollect I made that passed before though. It ended with copying the binary.
Also had a problem that something was looking for c++ and I had to make a link to gcc

@PeterBeukelman
Copy link

I changed to csh from bash and tried again to build bazel with gcc3.4.6
This is what I see with verbose_failures
[tensorflow] % sudo bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer --verbose_failures
INFO: Found 1 target...
INFO: From Compiling google/protobuf/src/google/protobuf/any.pb.cc [for host]:
gcc: bazel-out/host/genfiles: No such file or directory
gcc: unrecognized option `-no-canonical-prefixes'
gcc: .: linker input file unused because linking not done
cc1plus: error: unrecognized command line option "-iquote"
cc1plus: error: unrecognized command line option "-iquote"
cc1plus: error: unrecognized command line option "-fstack-protector"
cc1plus: error: unrecognized command line option "-Wno-free-nonheap-object"
cc1plus: error: unrecognized command line option "-Wno-error=unused-function"
cc1plus: error: unrecognized command line option "-Wno-builtin-macro-redefined"
cc1plus: error: unrecognized command line option "-std=c++11"
cc1plus: .: No such file or directory
ERROR: /home/peter/tensorflow/lib/python2.7/site-packages/tensorflow/google/protobuf/BUILD:63:1: C++ compilation of rule '//google/protobuf:protobuf' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command
(cd /root/.cache/bazel/bazel_root/0d235aa1ad00a592d6c87ed1f69bc69b/tensorflow &&
exec env -
INTERCEPT_LOCALLY_EXECUTABLE=1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections -fdata-sections -g0 '-std=c++11' -iquote . -iquote bazel-out/host/genfiles -isystem google/protobuf/src -isystem bazel-out/host/genfiles/google/protobuf/src -isystem tools/cpp/gcc3 -DHAVE_PTHREAD -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare '-Wno-error=unused-function' -no-canonical-prefixes -Wno-builtin-macro-redefined '-D__DATE
_="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' '-frandom-seed=bazel-out/host/bin/google/protobuf/objs/protobuf/google/protobuf/src/google/protobuf/any.pb.o' -MD -MF bazel-out/host/bin/google/protobuf/objs/protobuf/google/protobuf/src/google/protobuf/any.pb.d -c google/protobuf/src/google/protobuf/any.pb.cc -o bazel-out/host/bin/google/protobuf/objs/protobuf/google/protobuf/src/google/protobuf/any.pb.o): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1: crosstool_wrapper_driver_is_not_gcc failed: error executing command
(cd /root/.cache/bazel/bazel_root/0d235aa1ad00a592d6c87ed1f69bc69b/tensorflow &&
exec env -
INTERCEPT_LOCALLY_EXECUTABLE=1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections -fdata-sections -g0 '-std=c++11' -iquote . -iquote bazel-out/host/genfiles -isystem google/protobuf/src -isystem bazel-out/host/genfiles/google/protobuf/src -isystem tools/cpp/gcc3 -DHAVE_PTHREAD -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare '-Wno-error=unused-function' -no-canonical-prefixes -Wno-builtin-macro-redefined '-D__DATE
="redacted"' '-D__TIMESTAMP
="redacted"' '-D__TIME__="redacted"' '-frandom-seed=bazel-out/host/bin/google/protobuf/_objs/protobuf/google/protobuf/src/google/protobuf/any.pb.o' -MD -MF bazel-out/host/bin/google/protobuf/_objs/protobuf/google/protobuf/src/google/protobuf/any.pb.d -c google/protobuf/src/google/protobuf/any.pb.cc -o bazel-out/host/bin/google/protobuf/_objs/protobuf/google/protobuf/src/google/protobuf/any.pb.o): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
Target //tensorflow/cc:tutorials_example_trainer failed to build
INFO: Elapsed time: 0.579s, Critical Path: 0.16s

@jpmerc
Copy link
Author

jpmerc commented Dec 17, 2015

@zheng-xq It seems to work fine now, but it does not use my configuration (configured for Cuda 3.0 and it tries to use 3.5).

tensorflow/tensorflow/models/image/mnist$ sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-7.0/lib64 optirun python convolutional.py
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 8
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:903] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties: 
name: Quadro K1100M
major: 3 minor: 0 memoryClockRate (GHz) 0.7055
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 1.97GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:669] Ignoring gpu device (device: 0, name: Quadro K1100M, pci bus id: 0000:01:00.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
I tensorflow/core/common_runtime/direct_session.cc:60] Direct session inter op parallelism threads: 8
Initialized!
Epoch 0.00
Minibatch loss: 12.054, learning rate: 0.010000
Minibatch error: 90.6%
...

@PeterBeukelman
Copy link

That is an old Cuda version. Am I correct in thinking that my issues with Bazel are independent of issues with importing tensorflow in python resulting in
"failed call to cuInit: CUDA_ERROR_UNKNOWN"

@zheng-xq
Copy link
Contributor

@jpmerc, could you confirm that you run TF_UNOFFICIAL_SETTING=1 ./configure before starting the build?

@PeterBeukelman, I think they are most likely separate issues. Note that Bazel is best supported on Ubuntu 14.04 at the moment. The default gcc with that is 4.8.

@Cvikli
Copy link

Cvikli commented Jan 25, 2016

Had the same problem like @jpmerc, with Nvidia GTX 960m. And the problem was something connected with his: https://devtalk.nvidia.com/default/topic/907350/installing-cuda-7-0-but-get-cuda-7-5-/
I just reinstalled 7.0 and everything worked fine.

@girving
Copy link
Contributor

girving commented Mar 8, 2016

@martinwicke, @zheng-xq: Is this obsolete now that we support 7.5?

@recurse-id
Copy link

I had the same problem, this fixed it: sudo apt-get install nvidia-modprobe

@martinwicke
Copy link
Member

It should be fixed. I'll close this for now -- we can reopen if it's still a problem.

@ajwimmers
Copy link

ajwimmers commented Jun 11, 2016

It's worth adding that
sudo apt-get install nvidia-modprobe
fixed it for me too even though I had already installed it on a previous session directly before installing Tensorflow.

@liusiye
Copy link

liusiye commented Jun 28, 2016

sudo apt-get install nvidia-modprobe, this is magic

@hyy1111
Copy link

hyy1111 commented Jul 30, 2016

sudo apt-get install nvidia-modprobe, this fixed it for me too

@juanprietob
Copy link

I ran into this issue recently. I upgraded my nvidia-driver to version 375.26 and docker to version Docker version 1.13.0.
When training a network I would get the

cuInit: CUDA_ERROR_UNKNOWN

The problem here is that cuda fails to initiate the 'shared GPU context'.
For some reason, the 'nvidia-cuda-mps-control' service is not active after the upgrade. I need to investigate more.

However, try running nvidia-cuda-mps-server in the host machine. This solved it for me.

@WeitaoVan
Copy link

I had the same issue.
Simply reboot the computer fixed the problem for me :)
Suggestion: do not suspend your computer (which caused the problem in my case)

@leekyungmoon
Copy link

Is it necessary to reboot after installing(executing) 'sudo apt-get install nvidia-modprobe'?

@jamesdanged
Copy link

sudo apt-get install nvidia-modprobe works for me, with a restart.

@pyk
Copy link

pyk commented May 14, 2017

@leekyungmoon Reboot only works for me, without intalling nvidia-modprobe as @WeitaoVan said.

@dreamsuifeng
Copy link

sudo is fine for me with the same issue.

@liutongxuan
Copy link
Contributor

nvidia-cuda-mps-server works for me

@poppingtonic
Copy link

In Ubuntu 17.04, nvidia-cuda-mps-server doesn't work, it doesn't even output anything when I run the command. I've installed sudo apt-get install nvidia-384 and CUDA using sudo apt-get install nvidia-cuda-toolkit, and a simple test from this link compiles successfully, printing the resulting array. I can run optirun nvidia-smi, which shows that an X server is running (probably due to the virtual display).

nvidia-modprobe in 17.04 is still linked to nvidia-375, so doesn't work.

Multiple reboots later, I still get this issue.

Acer Predator Helios 300, GTX 1060

@brayan07
Copy link

brayan07 commented Oct 6, 2017

Are you still having trouble with this issue?

This is what worked for me. I had previously tried an installation of CUDA using a .run file. The installation had configured the nvidia-384 driver and this was precisely what I saw when I ran nvidia-smi. I ran /usr/bin/nvidia-uninstall and the CUDA_ERROR_UNKNOWN went away. Further, when I run nvidia-smi I see the expected driver version (375).

In summary, make sure that previous installations of drivers/CUDA are not the source of the error if the suggestions in this thread don't work. This is likely the case if one installation was done with a .run file while another was done via a .deb package.

@inoryy
Copy link

inoryy commented Oct 26, 2017

Everything worked fine under nvidia-384 drivers then suddenly broke with the CUDA_ERROR_UNKNOWN.
What worked for me is the opposite to solutions above - removing nvidia-modprobe, probably because it's tied to nvidia-375 as poster above mentioned.

@sheshap
Copy link

sheshap commented Nov 17, 2017

I am facing the "CUDA_ERROR_UNKNOWN" issue on Windows 2012 R2 server. Anybody tried it on windows server please help?

@GiggleLiu
Copy link

@leekyungmoon after installing nvidia-modprobe, reboot works for me.

@prasad3130
Copy link

In my case, nvidia-modprobe was installed and the paths were correct. What solved was running the commands here https://devtalk.nvidia.com/default/topic/760872/ubuntu-12-04-error-cudagetdevicecount-returned-30/

Especially, running following:
$ sudo modinfo nvidia-<driver_version_num>-uvm (with driver_version_num as 384 in my case)
$ sudo modprobe --force-modversion nvidia-331-uvm

Hope this helps.

@tharuniitk
Copy link

Interestingly none of these worked for me. Adding following to .bashrc worked like charm!!

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/usr/local/cuda-8.0/targets/x86_64-linux/lib/"

@DjangoPeng
Copy link
Contributor

Same here @tharuniitk

None of these work.

@ffs97
Copy link

ffs97 commented Feb 4, 2018

@prasad3130 Thanks a lot, that worked like a charm. Although it is worth noting that the command should be the following
sudo modprobe --force-modversion nvidia-<nvidia-version>-uvm

@gfphoenix
Copy link

gfphoenix commented Feb 8, 2018

If you work in linux/ubuntu, check your kernel, by find /lib/modules/ | grep -i nvidia. Make sure the models exist 'nvidia-modeset.ko, dkms/nvidia-uvm.ko, dkms/nvidia-drm.ko, dkms/nvidia.ko' and /proc/driver/nvidia/ . My tensorflow-cuda works very well. But someday, it suddenly throw this error, when I switched to the old kernel which has the above modules, it works again. This error hanppens that ubuntu switched to the newest kernel siliently, but has less necessary modules.
Wish it helps

@enesunal
Copy link

enesunal commented Mar 5, 2018

try sudo ldconfig after installling cuda & cudnn.

@rk-roman
Copy link

sudo apt-get install nvidia-modprobe worked for me without restart on 16.04

@rohitbhio
Copy link

nvidia-cuda-mps-server solved the problem for me after upgrading tensorflow to 1.7.0

@piotrkochan
Copy link

Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package nvidia-modprobe

Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.1 LTS
Release:	18.04
Codename:	bionic

@alancneves
Copy link

alancneves commented Sep 11, 2018

I've had a problem running darknet detector on ubuntu in multiuser environment. To solve the issue, i've exported CUDA_CACHE_PATH variable as: export CUDA_CACHE_PATH=/tmp/nvidia
before using GPU.

@gzhcv
Copy link

gzhcv commented Oct 4, 2018

System information

  • OS Platform and Distribution : Ubuntu 14.04

  • TensorFlow version : 1.10 gpu

  • Python version: 2.7

  • CUDA/cuDNN version: 9.0 / 7

  • GPU model and memory: Nvidia GeForce GTX TITAN X

  • nvidia-smi:

+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:02:00.0 Off | N/A |
| 22% 61C P0 75W / 250W | 0MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:03:00.0 Off | N/A |
| 23% 63C P0 78W / 250W | 0MiB / 12207MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:83:00.0 Off | N/A |
| 22% 60C P0 77W / 250W | 0MiB / 12207MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TIT... Off | 0000:84:00.0 Off | N/A |
| 22% 61C P0 72W / 250W | 0MiB / 12207MiB | 0% Default |
+-------------------------------+----------------------+-----------------

  • find /lib/modules/ | grep -i nvidia

/lib/modules/4.2.0-27-generic/kernel/drivers/net/ethernet/nvidia
/lib/modules/4.2.0-27-generic/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
/lib/modules/4.2.0-27-generic/kernel/drivers/video/fbdev/nvidia
/lib/modules/4.2.0-27-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/4.2.0-27-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.2.0-27-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.2.0-27-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.2.0-27-generic/updates/dkms/nvidia_384_drm.ko

Describe the problem

I upgrade the nvidia driver througth command: sudo apt-get install nvidia-384. Then i found there are serveal nvidia driver installed througth command: sudo dpkg --list | grep nvidia-*,so i uninstalled these driver except nvidia-384 use commadn: sudo apt-get remove xxx. After that, the info as follows:

ii nvidia-384 384.130-0ubuntu0.14.04.1 amd64 NVIDIA binary driver - version 384.130
ii nvidia-opencl-icd-384 384.130-0ubuntu0.14.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.6.2.1 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 352.39-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver

Error occured when i run tensorflow code as follows:
2018-10-03 23:13:51.656015: E tensorflow/stream_executor/cuda/cuda_driver.cc:397 ] failed call to cuInit: CUDA_ERROR_UNKNOWN 2018-10-03 23:13:51.656131: I tensorflow/stream_executor/cuda/cuda_diagnostics.c c:163] retrieving CUDA diagnostic information for host: root0-SCW4350-16 2018-10-03 23:13:51.656166: I tensorflow/stream_executor/cuda/cuda_diagnostics.c c:170] hostname: root0-SCW4350-16 2018-10-03 23:13:51.656299: I tensorflow/stream_executor/cuda/cuda_diagnostics.c c:194] libcuda reported version is: 384.130.0 2018-10-03 23:13:51.656428: I tensorflow/stream_executor/cuda/cuda_diagnostics.c c:198] kernel reported version is: 384.130.0 2018-10-03 23:13:51.656465: I tensorflow/stream_executor/cuda/cuda_diagnostics.c c:305] kernel version seems to match DSO: 384.130.0 <tensorflow.python.client.session.Session object at 0x7f30a08ae5d0>

i have tried many solutions such as install nvidia-modprobe , nvidia-cuda-mps-server , export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/extras/CUPTI/lib64:/usr/local/cuda-9.0/targets/x86_64-linux/lib/" all of these don't work.

Maybe It is noteworthy that nvidia-smi shows that NVIDIA-SMI 352.63 Driver Version: 384.130 and sudo dpkg --list | grep nvidia-* shows that nvidia-settings 352.39-0ubuntu1 . It seems that some moulde of nvidia-352 are not uninstalled. And i tried to installed nvidia driver througth nvidia_xxxx.run file,but the error remain while running tensorflow code.

Hopefully you can help me with this issue.

@nhatuan84
Copy link

in my case, the NVidia graphic driver accidently switch to another version that is not suitable. So just re-install the suitable version then it works again.

@mariomeissner
Copy link

What's the equivalent solution on Windows?

@sunxianfeng
Copy link

`
def ja():

`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests