failed call to cuInit: CUDA_ERROR_UNKNOWN in python programs using Ubuntu bumblebee #394

jpmerc · 2015-12-02T14:55:01Z

I have a Quadro K1100M integrated gpu with compute capability 3.0. I had to install bumblebee to make CUDA work. I am now able to run the tutorials_example_trainer with the command sudo optirun bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu. I have been able to do that with TF_UNOFFICIAL_SETTING=1 ./configure. However, I am not able to run examples in python directly.

For example, if I run the convolutional.py in tensorflow/models/image/mnist with the command optirun python convolutional.py, I get the following error :

tensorflow/tensorflow/models/image/mnist$ optirun python convolutional.py 
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
E tensorflow/stream_executor/cuda/cuda_driver.cc:466] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:98] retrieving CUDA diagnostic information for host: jp-pc
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:106] hostname: jp-pc
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:131] libcuda reported version is: 352.63
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:242] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  352.63  Sat Nov  7 21:25:42 PST 2015
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) 
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:135] kernel reported version is: 352.63
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:211] kernel version seems to match DSO: 352.63
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8

It is like my gpu is not recognized in python programs because of the 3.0 compute capability. Is there a way to avoid this problem?

The text was updated successfully, but these errors were encountered:

vrv · 2015-12-02T17:03:55Z

Updated subject to reflect the environment you're trying to run in. Hopefully someone in the community who knows more about bumblebee/optimus laptops might be able to help!

jpmerc · 2015-12-02T17:23:24Z

Just to be clear, I am able to run any programs with Bazel build. However, when using simple python programs where the tensorflow library is imported, the gpu will not work and I am stuck with using only the cpus.

girving · 2015-12-07T22:39:15Z

@vrv: Assigning you since it doesn't let me assign zheng-xq. Do you know why?

vrv · 2015-12-07T22:43:37Z

@girving: as discussed offline, fixing that.

Yeah, CUDA_ERROR_UNKNOWN is not very helpful. hopefully @zheng-xq might know more what's going on here

zheng-xq · 2015-12-11T20:39:04Z

@jpmerc, could you run your command line through sudo, similar to you C++ examples? I wonder whether it is the root access that is making the difference. The initialization logic should be the same between C++ and Python clients.

jpmerc · 2015-12-11T21:04:42Z

It doesn't seem to find the cuda library when in sudo :

$ sudo optirun python convolutional.py 

Traceback (most recent call last):
  File "convolutional.py", line 30, in <module>
    import tensorflow.python.platform
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/__init__.py", line 23, in <module>
    from tensorflow.python import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python.client.client_lib import *
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/client_lib.py", line 54, in <module>
    from tensorflow.python.client.session import InteractiveSession
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 31, in <module>
    from tensorflow.python import pywrap_tensorflow as tf_session
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 28, in <module>
    _pywrap_tensorflow = swig_import_helper()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description)
ImportError: libcudart.so.7.0: cannot open shared object file: No such file or directory

zheng-xq · 2015-12-11T21:25:35Z

Interesting. Could you add the path to your Cuda 7.0 runtime to LD_LIBRARY_PATH?

jpmerc · 2015-12-11T21:35:57Z

It is already in LD_LIBRARY_PATH.

In LD_LIBRARY_PATH I have :
/usr/local/cuda-7.0/lib64
/usr/local/cuda-7.0/lib

The library the program is looking for is there :
/usr/local/cuda-7.0/lib/libcudart.so.7.0
/usr/local/cuda-7.0/lib64/libcudart.so.7.0

PeterBeukelman · 2015-12-15T22:10:24Z

Did someone find an answer to this?
When running word2vec_basic.py I get

I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
E tensorflow/stream_executor/cuda/cuda_driver.cc:466] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:98] retrieving CUDA diagnostic information for host: peter-linux
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:106] hostname: peter-linux
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:131] libcuda reported version is: 352.63
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:242] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015
GCC version: gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:135] kernel reported version is: 352.63
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:211] kernel version seems to match DSO: 352.63
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA:
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8

It runs, but probably without GPU except for the end when it complains:
Please install sklearn and matplotlib to visualize embeddings.
Unfortunately sklearn and matplotlib are installed.
That being said, I see matplotlib fails selftest. numpy version 1.10.2

martinwicke · 2015-12-15T22:53:41Z

Could you run nvidia-debugdump -l or nvidia-smi and paste the output? I had a similar problem and in the end it was a lack of power for the graphics card.

PeterBeukelman · 2015-12-15T22:56:28Z

Found 1 NVIDIA devices
Device ID: 0
Device name: GeForce GTX TITAN X (*PrimaryCard)
GPU internal ID: 0420115018258

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1270 G /usr/bin/X 174MiB |
| 0 2183 G compiz 112MiB |
| 0 2575 G ...ves-passed-by-fd --v8-snapshot-passed-by- 127MiB |
+-----------------------------------------------------------------------------+

martinwicke · 2015-12-15T23:34:53Z

Well, it was worth a shot.

On Tue, Dec 15, 2015 at 3:28 PM PeterBeukelman notifications@github.com
wrote:

Found 1 NVIDIA devices
Device ID: 0
Device name: GeForce GTX TITAN X (*PrimaryCard)
GPU internal ID: 0420115018258

Tue Dec 15 23:56:17 2015

+------------------------------------------------------+

| NVIDIA-SMI 352.63 Driver Version: 352.63 |

|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:04:00.0 On | N/A |
| 22% 33C P8 17W / 250W | 441MiB / 12287MiB | 1% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |

|=============================================================================|
| 0 1270 G /usr/bin/X 174MiB |
| 0 2183 G compiz 112MiB |
| 0 2575 G ...ves-passed-by-fd --v8-snapshot-passed-by- 127MiB |

+-----------------------------------------------------------------------------+

—
Reply to this email directly or view it on GitHub
#394 (comment)
.

zheng-xq · 2015-12-16T00:08:25Z

@jpmerc, could you try to set LD_LIBRARY_PATH inside your sudo? That should make sudo preserve the environment variables.

sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-7.0/lib64 optirun python convolutional.py

@PeterBeukelman, to make sure you have the same problem, could you run the C++ tutorial?

bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

PeterBeukelman · 2015-12-16T08:47:08Z

I think the basic mnist worked with GPU early on. But I ran into problem with word2vec visualization. Somewhere down the search to fix that, I noticed 2 different driver versions being mentioned 346 and 352. This made me think I mistakenly updated the driver and tried to revert to 346. I purged everything but after install still ended up with 352.
I also had the wrong cuda 7.5 installed initially. I have made a link from /usr/local/cuda-7.0 to /usr/local/cuda
LD_LIBRARY_PATH
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib/python2.7/dist-packages/tensorflow/python/:
For bazel build, I had the problem
<INFO: From Compiling tensorflow/core/kernels/cwise_op_gpu_cos.cu.cc:
<In file included from third_party/gpus/cuda/include/cuda_runtime.h:62:0,
< from :0:
<third_party/gpus/cuda/include/host_config.h:105:2: error: #error -- unsupported GNU version! gcc <4.10 and up are not supported!
< #error -- unsupported GNU version! gcc 4.10 and up are not supported!
< ^
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: output <'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/cwise_op_gpu_cos.cu.o' was not created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: output <'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/cwise_op_gpu_cos.cu.d' was not created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: not all outputs were created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: output <'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/constant_op_gpu.cu.o' was not created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: output <'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/constant_op_gpu.cu.d' was not created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: not all outputs were created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: output <'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/cwise_op_gpu_ceil.cu.o' was not created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: output <'tensorflow/core/_objs/gpu_kernels/tensorflow/core/kernels/cwise_op_gpu_ceil.cu.d' was not created.
<ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:248:1: not all outputs were created.

I had this before also and dug up some old version of gcc (3.4) if I use that version
< gcc: unrecognized option `-no-canonical-prefixes'
< gcc: .: linker input file unused because linking not done
< gcc: bazel-out/local_linux-opt/genfiles: linker input file unused because linking not done
< cc1plus: error: unrecognized command line option "-iquote"
< cc1plus: error: unrecognized command line option "-iquote"
< cc1plus: error: unrecognized command line option "-fstack-protector"
< cc1plus: error: unrecognized command line option "-Wno-free-nonheap-object"
< cc1plus: error: unrecognized command line option "-Wno-builtin-macro-redefined"
< cc1plus: error: unrecognized command line option "-std=c++11"
< cc1plus: .: No such file or directory
< ERROR: /home/peter/tensorflow_sources/tensorflow/core/BUILD:115:1: C++ compilation of rule '//tensorflow/core:direct_session' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object ... (remaining 59 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
< Target //tensorflow/cc:tutorials_example_trainer failed to build

I recollect I made that passed before though. It ended with copying the binary.
Also had a problem that something was looking for c++ and I had to make a link to gcc

PeterBeukelman · 2015-12-16T22:07:17Z

I changed to csh from bash and tried again to build bazel with gcc3.4.6
This is what I see with verbose_failures
[tensorflow] % sudo bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer --verbose_failures
INFO: Found 1 target...
INFO: From Compiling google/protobuf/src/google/protobuf/any.pb.cc [for host]:
gcc: bazel-out/host/genfiles: No such file or directory
gcc: unrecognized option `-no-canonical-prefixes'
gcc: .: linker input file unused because linking not done
cc1plus: error: unrecognized command line option "-iquote"
cc1plus: error: unrecognized command line option "-iquote"
cc1plus: error: unrecognized command line option "-fstack-protector"
cc1plus: error: unrecognized command line option "-Wno-free-nonheap-object"
cc1plus: error: unrecognized command line option "-Wno-error=unused-function"
cc1plus: error: unrecognized command line option "-Wno-builtin-macro-redefined"
cc1plus: error: unrecognized command line option "-std=c++11"
cc1plus: .: No such file or directory
ERROR: /home/peter/tensorflow/lib/python2.7/site-packages/tensorflow/google/protobuf/BUILD:63:1: C++ compilation of rule '//google/protobuf:protobuf' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command
(cd /root/.cache/bazel/bazel_root/0d235aa1ad00a592d6c87ed1f69bc69b/tensorflow &&
exec env -
INTERCEPT_LOCALLY_EXECUTABLE=1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections -fdata-sections -g0 '-std=c++11' -iquote . -iquote bazel-out/host/genfiles -isystem google/protobuf/src -isystem bazel-out/host/genfiles/google/protobuf/src -isystem tools/cpp/gcc3 -DHAVE_PTHREAD -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare '-Wno-error=unused-function' -no-canonical-prefixes -Wno-builtin-macro-redefined '-D__DATE_="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' '-frandom-seed=bazel-out/host/bin/google/protobuf/objs/protobuf/google/protobuf/src/google/protobuf/any.pb.o' -MD -MF bazel-out/host/bin/google/protobuf/objs/protobuf/google/protobuf/src/google/protobuf/any.pb.d -c google/protobuf/src/google/protobuf/any.pb.cc -o bazel-out/host/bin/google/protobuf/objs/protobuf/google/protobuf/src/google/protobuf/any.pb.o): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1: crosstool_wrapper_driver_is_not_gcc failed: error executing command
(cd /root/.cache/bazel/bazel_root/0d235aa1ad00a592d6c87ed1f69bc69b/tensorflow &&
exec env -
INTERCEPT_LOCALLY_EXECUTABLE=1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections -fdata-sections -g0 '-std=c++11' -iquote . -iquote bazel-out/host/genfiles -isystem google/protobuf/src -isystem bazel-out/host/genfiles/google/protobuf/src -isystem tools/cpp/gcc3 -DHAVE_PTHREAD -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare '-Wno-error=unused-function' -no-canonical-prefixes -Wno-builtin-macro-redefined '-D__DATE="redacted"' '-D__TIMESTAMP="redacted"' '-D__TIME__="redacted"' '-frandom-seed=bazel-out/host/bin/google/protobuf/_objs/protobuf/google/protobuf/src/google/protobuf/any.pb.o' -MD -MF bazel-out/host/bin/google/protobuf/_objs/protobuf/google/protobuf/src/google/protobuf/any.pb.d -c google/protobuf/src/google/protobuf/any.pb.cc -o bazel-out/host/bin/google/protobuf/_objs/protobuf/google/protobuf/src/google/protobuf/any.pb.o): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
Target //tensorflow/cc:tutorials_example_trainer failed to build
INFO: Elapsed time: 0.579s, Critical Path: 0.16s

jpmerc · 2015-12-17T14:13:46Z

@zheng-xq It seems to work fine now, but it does not use my configuration (configured for Cuda 3.0 and it tries to use 3.5).

tensorflow/tensorflow/models/image/mnist$ sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-7.0/lib64 optirun python convolutional.py
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 8
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:903] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties: 
name: Quadro K1100M
major: 3 minor: 0 memoryClockRate (GHz) 0.7055
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 1.97GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:669] Ignoring gpu device (device: 0, name: Quadro K1100M, pci bus id: 0000:01:00.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
I tensorflow/core/common_runtime/direct_session.cc:60] Direct session inter op parallelism threads: 8
Initialized!
Epoch 0.00
Minibatch loss: 12.054, learning rate: 0.010000
Minibatch error: 90.6%
...

PeterBeukelman · 2015-12-17T14:20:57Z

That is an old Cuda version. Am I correct in thinking that my issues with Bazel are independent of issues with importing tensorflow in python resulting in
"failed call to cuInit: CUDA_ERROR_UNKNOWN"

zheng-xq · 2015-12-17T18:40:58Z

@jpmerc, could you confirm that you run TF_UNOFFICIAL_SETTING=1 ./configure before starting the build?

@PeterBeukelman, I think they are most likely separate issues. Note that Bazel is best supported on Ubuntu 14.04 at the moment. The default gcc with that is 4.8.

Cvikli · 2016-01-25T14:12:16Z

Had the same problem like @jpmerc, with Nvidia GTX 960m. And the problem was something connected with his: https://devtalk.nvidia.com/default/topic/907350/installing-cuda-7-0-but-get-cuda-7-5-/
I just reinstalled 7.0 and everything worked fine.

girving · 2016-03-08T19:25:52Z

@martinwicke, @zheng-xq: Is this obsolete now that we support 7.5?

recurse-id · 2016-03-09T05:23:28Z

I had the same problem, this fixed it: sudo apt-get install nvidia-modprobe

martinwicke · 2016-03-10T06:43:43Z

It should be fixed. I'll close this for now -- we can reopen if it's still a problem.

ajwimmers · 2016-06-11T14:55:10Z

It's worth adding that
sudo apt-get install nvidia-modprobe
fixed it for me too even though I had already installed it on a previous session directly before installing Tensorflow.

liusiye · 2016-06-28T05:06:52Z

sudo apt-get install nvidia-modprobe, this is magic

hyy1111 · 2016-07-30T11:59:45Z

sudo apt-get install nvidia-modprobe, this fixed it for me too

juanprietob · 2017-02-07T22:29:21Z

I ran into this issue recently. I upgraded my nvidia-driver to version 375.26 and docker to version Docker version 1.13.0.
When training a network I would get the

cuInit: CUDA_ERROR_UNKNOWN

The problem here is that cuda fails to initiate the 'shared GPU context'.
For some reason, the 'nvidia-cuda-mps-control' service is not active after the upgrade. I need to investigate more.

However, try running nvidia-cuda-mps-server in the host machine. This solved it for me.

WeitaoVan · 2017-03-02T08:29:24Z

I had the same issue.
Simply reboot the computer fixed the problem for me :)
Suggestion: do not suspend your computer (which caused the problem in my case)

leekyungmoon · 2017-03-27T18:16:59Z

Is it necessary to reboot after installing(executing) 'sudo apt-get install nvidia-modprobe'?

jamesdanged · 2017-05-09T17:31:04Z

sudo apt-get install nvidia-modprobe works for me, with a restart.

pyk · 2017-05-14T13:57:54Z

@leekyungmoon Reboot only works for me, without intalling nvidia-modprobe as @WeitaoVan said.

dreamsuifeng · 2017-07-11T15:27:04Z

sudo is fine for me with the same issue.

liutongxuan · 2017-08-09T03:28:40Z

nvidia-cuda-mps-server works for me

poppingtonic · 2017-08-30T11:17:42Z

In Ubuntu 17.04, nvidia-cuda-mps-server doesn't work, it doesn't even output anything when I run the command. I've installed sudo apt-get install nvidia-384 and CUDA using sudo apt-get install nvidia-cuda-toolkit, and a simple test from this link compiles successfully, printing the resulting array. I can run optirun nvidia-smi, which shows that an X server is running (probably due to the virtual display).

nvidia-modprobe in 17.04 is still linked to nvidia-375, so doesn't work.

Multiple reboots later, I still get this issue.

Acer Predator Helios 300, GTX 1060

brayan07 · 2017-10-06T21:22:18Z

Are you still having trouble with this issue?

This is what worked for me. I had previously tried an installation of CUDA using a .run file. The installation had configured the nvidia-384 driver and this was precisely what I saw when I ran nvidia-smi. I ran /usr/bin/nvidia-uninstall and the CUDA_ERROR_UNKNOWN went away. Further, when I run nvidia-smi I see the expected driver version (375).

In summary, make sure that previous installations of drivers/CUDA are not the source of the error if the suggestions in this thread don't work. This is likely the case if one installation was done with a .run file while another was done via a .deb package.

inoryy · 2017-10-26T06:19:30Z

Everything worked fine under nvidia-384 drivers then suddenly broke with the CUDA_ERROR_UNKNOWN.
What worked for me is the opposite to solutions above - removing nvidia-modprobe, probably because it's tied to nvidia-375 as poster above mentioned.

sheshap · 2017-11-17T21:04:10Z

I am facing the "CUDA_ERROR_UNKNOWN" issue on Windows 2012 R2 server. Anybody tried it on windows server please help?

GiggleLiu · 2017-11-21T01:39:02Z

@leekyungmoon after installing nvidia-modprobe, reboot works for me.

prasad3130 · 2017-11-21T06:49:10Z

In my case, nvidia-modprobe was installed and the paths were correct. What solved was running the commands here https://devtalk.nvidia.com/default/topic/760872/ubuntu-12-04-error-cudagetdevicecount-returned-30/

Especially, running following:
$ sudo modinfo nvidia-<driver_version_num>-uvm (with driver_version_num as 384 in my case)
$ sudo modprobe --force-modversion nvidia-331-uvm

Hope this helps.

tharuniitk · 2018-01-07T21:56:33Z

Interestingly none of these worked for me. Adding following to .bashrc worked like charm!!

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/usr/local/cuda-8.0/targets/x86_64-linux/lib/"

DjangoPeng · 2018-01-15T07:52:55Z

Same here @tharuniitk

None of these work.

ffs97 · 2018-02-04T15:22:33Z

@prasad3130 Thanks a lot, that worked like a charm. Although it is worth noting that the command should be the following
sudo modprobe --force-modversion nvidia-<nvidia-version>-uvm

gfphoenix · 2018-02-08T05:13:34Z

If you work in linux/ubuntu, check your kernel, by find /lib/modules/ | grep -i nvidia. Make sure the models exist 'nvidia-modeset.ko, dkms/nvidia-uvm.ko, dkms/nvidia-drm.ko, dkms/nvidia.ko' and /proc/driver/nvidia/ . My tensorflow-cuda works very well. But someday, it suddenly throw this error, when I switched to the old kernel which has the above modules, it works again. This error hanppens that ubuntu switched to the newest kernel siliently, but has less necessary modules.
Wish it helps

enesunal · 2018-03-05T10:27:48Z

try sudo ldconfig after installling cuda & cudnn.

rk-roman · 2018-03-30T00:54:34Z

sudo apt-get install nvidia-modprobe worked for me without restart on 16.04

rohitbhio · 2018-04-24T21:25:31Z

nvidia-cuda-mps-server solved the problem for me after upgrading tensorflow to 1.7.0

piotrkochan · 2018-08-29T07:19:44Z

Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package nvidia-modprobe

Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.1 LTS
Release:	18.04
Codename:	bionic

alancneves · 2018-09-11T19:04:03Z

I've had a problem running darknet detector on ubuntu in multiuser environment. To solve the issue, i've exported CUDA_CACHE_PATH variable as: export CUDA_CACHE_PATH=/tmp/nvidia
before using GPU.

gzhcv · 2018-10-04T05:33:50Z

System information

OS Platform and Distribution : Ubuntu 14.04
TensorFlow version : 1.10 gpu
Python version: 2.7
CUDA/cuDNN version: 9.0 / 7
GPU model and memory: Nvidia GeForce GTX TITAN X
nvidia-smi:

find /lib/modules/ | grep -i nvidia

/lib/modules/4.2.0-27-generic/kernel/drivers/net/ethernet/nvidia
/lib/modules/4.2.0-27-generic/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
/lib/modules/4.2.0-27-generic/kernel/drivers/video/fbdev/nvidia
/lib/modules/4.2.0-27-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/4.2.0-27-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.2.0-27-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.2.0-27-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.2.0-27-generic/updates/dkms/nvidia_384_drm.ko

Describe the problem

I upgrade the nvidia driver througth command: sudo apt-get install nvidia-384. Then i found there are serveal nvidia driver installed througth command: sudo dpkg --list | grep nvidia-*，so i uninstalled these driver except nvidia-384 use commadn: sudo apt-get remove xxx. After that, the info as follows:

ii nvidia-384 384.130-0ubuntu0.14.04.1 amd64 NVIDIA binary driver - version 384.130
ii nvidia-opencl-icd-384 384.130-0ubuntu0.14.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.6.2.1 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 352.39-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver

Error occured when i run tensorflow code as follows:
2018-10-03 23:13:51.656015: E tensorflow/stream_executor/cuda/cuda_driver.cc:397 ] failed call to cuInit: CUDA_ERROR_UNKNOWN 2018-10-03 23:13:51.656131: I tensorflow/stream_executor/cuda/cuda_diagnostics.c c:163] retrieving CUDA diagnostic information for host: root0-SCW4350-16 2018-10-03 23:13:51.656166: I tensorflow/stream_executor/cuda/cuda_diagnostics.c c:170] hostname: root0-SCW4350-16 2018-10-03 23:13:51.656299: I tensorflow/stream_executor/cuda/cuda_diagnostics.c c:194] libcuda reported version is: 384.130.0 2018-10-03 23:13:51.656428: I tensorflow/stream_executor/cuda/cuda_diagnostics.c c:198] kernel reported version is: 384.130.0 2018-10-03 23:13:51.656465: I tensorflow/stream_executor/cuda/cuda_diagnostics.c c:305] kernel version seems to match DSO: 384.130.0 <tensorflow.python.client.session.Session object at 0x7f30a08ae5d0>

i have tried many solutions such as install nvidia-modprobe , nvidia-cuda-mps-server , export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/extras/CUPTI/lib64:/usr/local/cuda-9.0/targets/x86_64-linux/lib/" all of these don't work.

Maybe It is noteworthy that nvidia-smi shows that NVIDIA-SMI 352.63 Driver Version: 384.130 and sudo dpkg --list | grep nvidia-* shows that nvidia-settings 352.39-0ubuntu1 . It seems that some moulde of nvidia-352 are not uninstalled. And i tried to installed nvidia driver througth nvidia_xxxx.run file，but the error remain while running tensorflow code.

Hopefully you can help me with this issue.

nhatuan84 · 2018-11-04T04:21:29Z

in my case, the NVidia graphic driver accidently switch to another version that is not suitable. So just re-install the suitable version then it works again.

mariomeissner · 2019-04-19T12:12:04Z

What's the equivalent solution on Windows?

sunxianfeng · 2019-05-20T07:51:14Z

`
def ja():

`

vrv changed the title ~~failed call to cuInit: CUDA_ERROR_UNKNOWN in python programs~~ failed call to cuInit: CUDA_ERROR_UNKNOWN in python programs using Ubuntu bumblebee Dec 2, 2015

girving assigned vrv Dec 7, 2015

girving added the cuda label Dec 7, 2015

dcunited001 mentioned this issue Dec 23, 2015

failed call to cuInit: CUDA_ERROR_UNKNOWN after Docker build on Macbook Pro (Late 2013) with Linux #601

Closed

martinwicke closed this as completed Mar 10, 2016

poppingtonic mentioned this issue Aug 30, 2017

failed call to cuInit: CUDA_ERROR_UNKNOWN #7653

Closed

Cherishty mentioned this issue Jun 11, 2019

How to leverage the cuda in host machine without install cuda in docker image NVIDIA/k8s-device-plugin#120

Closed

11 tasks

peastman mentioned this issue Jun 9, 2020

Simulations die after running for some time, due to CUDA: Error deleting array bondParams openmm/openmm#2735

Closed

UsharaniPagadala mentioned this issue Sep 14, 2021

Problem using Tensorflow serving and TF-TRT model tensorflow/serving#1914

Closed

failed call to cuInit: CUDA_ERROR_UNKNOWN in python programs using Ubuntu bumblebee #394

failed call to cuInit: CUDA_ERROR_UNKNOWN in python programs using Ubuntu bumblebee #394

Comments

jpmerc commented Dec 2, 2015

vrv commented Dec 2, 2015

jpmerc commented Dec 2, 2015

girving commented Dec 7, 2015

vrv commented Dec 7, 2015

zheng-xq commented Dec 11, 2015

jpmerc commented Dec 11, 2015

zheng-xq commented Dec 11, 2015

jpmerc commented Dec 11, 2015

PeterBeukelman commented Dec 15, 2015

martinwicke commented Dec 15, 2015

PeterBeukelman commented Dec 15, 2015

martinwicke commented Dec 15, 2015

zheng-xq commented Dec 16, 2015

PeterBeukelman commented Dec 16, 2015

PeterBeukelman commented Dec 16, 2015

jpmerc commented Dec 17, 2015

PeterBeukelman commented Dec 17, 2015

zheng-xq commented Dec 17, 2015

Cvikli commented Jan 25, 2016

girving commented Mar 8, 2016

recurse-id commented Mar 9, 2016

martinwicke commented Mar 10, 2016

ajwimmers commented Jun 11, 2016 • edited

liusiye commented Jun 28, 2016

hyy1111 commented Jul 30, 2016

juanprietob commented Feb 7, 2017

WeitaoVan commented Mar 2, 2017

leekyungmoon commented Mar 27, 2017

jamesdanged commented May 9, 2017

pyk commented May 14, 2017

dreamsuifeng commented Jul 11, 2017

liutongxuan commented Aug 9, 2017

poppingtonic commented Aug 30, 2017

brayan07 commented Oct 6, 2017 • edited

Are you still having trouble with this issue?

inoryy commented Oct 26, 2017 • edited

sheshap commented Nov 17, 2017

GiggleLiu commented Nov 21, 2017

prasad3130 commented Nov 21, 2017

tharuniitk commented Jan 7, 2018

DjangoPeng commented Jan 15, 2018

ffs97 commented Feb 4, 2018 • edited

gfphoenix commented Feb 8, 2018 • edited

enesunal commented Mar 5, 2018 • edited

rk-roman commented Mar 30, 2018

rohitbhio commented Apr 24, 2018

piotrkochan commented Aug 29, 2018

alancneves commented Sep 11, 2018 • edited

gzhcv commented Oct 4, 2018

System information

Describe the problem

nhatuan84 commented Nov 4, 2018

mariomeissner commented Apr 19, 2019

sunxianfeng commented May 20, 2019

ajwimmers commented Jun 11, 2016 •

edited

brayan07 commented Oct 6, 2017 •

edited

inoryy commented Oct 26, 2017 •

edited

ffs97 commented Feb 4, 2018 •

edited

gfphoenix commented Feb 8, 2018 •

edited

enesunal commented Mar 5, 2018 •

edited

alancneves commented Sep 11, 2018 •

edited