nvidia-smi: No running processes found #8877

Closed
dbl001 opened this Issue Mar 31, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@dbl001

dbl001 commented Mar 31, 2017

I'm running Ubuntu 14.04 LTS on AWS g2.2xlarge

$ python neural_gpu_trainer.py --problem=bmul

...

modprobe: ERROR: could not insert 'nvidia_375_uvm': Invalid argument
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_UNKNOWN

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

http://stackoverflow.com/questions/33970755/tensorflow-not-using-gpu

Environment info

Operating System:
Ubuntu 14.04 LTS

Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*):
-rw-r--r-- 1 root root 179466 Mar 27 16:01 /usr/local/cuda/lib/libcudadevrt.a
lrwxrwxrwx 1 root root 16 Mar 27 16:01 /usr/local/cuda/lib/libcudart.so -> libcudart.so.7.0
lrwxrwxrwx 1 root root 19 Mar 27 16:01 /usr/local/cuda/lib/libcudart.so.7.0 -> libcudart.so.7.0.28
-rwxr-xr-x 1 root root 303052 Mar 27 16:01 /usr/local/cuda/lib/libcudart.so.7.0.28
-rw-r--r-- 1 root root 546514 Mar 27 16:01 /usr/local/cuda/lib/libcudart_static.a

If installed from binary pip package, provide:
$ conda install -c jjh_cio_testing tensorflow-gpu

  1. A link to the pip package you installed:

  2. The output from python -c "import tensorflow; print(tensorflow.__version__)".
    ubuntu@ip-10-0-1-131:~/cuda/lib64$ python -c "import tensorflow; print(tensorflow.version)"
    I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
    I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
    I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
    I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
    I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally
    1.0.1

If installed from source, provide

  1. The commit hash (git rev-parse HEAD)
  2. The output of bazel version

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

ubuntu@ip-10-0-1-131:~/models/neural_gpu$ python neural_gpu_trainer.py --problem=bmul
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally
Generating data for bmul.
cut 1.20 lr 0.100 iw 0.80 cr 0.30 nm 64 d0.1000 gn 4.00 layers 2 kw 3 h 4 kh 3 batch 32 noise 0.00
Creating model.
Creating backward pass for the model.
WARNING:tensorflow:Tried to colocate gpu0/gradients/gpu0/Gather_2_grad/Shape with an op target_embedding/read that had a different device: /device:GPU:0 vs /device:CPU:0. Ignoring colocation property.
Created model for gpu 0 in 6.46 s.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
modprobe: ERROR: could not insert 'nvidia_375_uvm': Invalid argument
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: ip-10-0-1-131
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: ip-10-0-1-131
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 346.46.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  346.46  Tue Feb 17 17:56:08 PST 2015
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) 
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 346.46.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 346.46.0
Created model. Checkpoint dir /tmp/neural_gpu
Reading model parameters from /tmp/neural_gpu/neural_gpu.ckpt-500
step 600 step-time 1.15 train-size 0.051 lr 0.100000 grad-norm 1.3013 len 6 ppl 1.335808 errors 29.08 sequence-errors 23.34
  bin 0 (2)	bmul	ppl 1.08 errors 0.00 seq-errors 0.00
  bin 1 (3)	bmul	ppl 1.07 errors 0.00 seq-errors 0.00
  bin 2 (4)	bmul	ppl NA errors NA seq-errors NA
  bin 3 (5)	bmul	ppl 1.14 errors 2.90 seq-errors 5.47
  bin 4 (6)	bmul	ppl NA errors NA seq-errors NA
  bin 7 (9)	bmul	ppl 1.67 errors 23.42 seq-errors 61.72
  bin 10 (12)	bmul	ppl NA errors NA seq-errors NA
  bin 13 (15)	bmul	ppl 1.94 errors 36.26 seq-errors 96.88
  bin 16 (18)	bmul	ppl NA errors NA seq-errors NA
  bin 19 (21)	bmul	ppl 2.07 errors 41.34 seq-errors 100.00
  bin 22 (24)	bmul	ppl NA errors NA seq-errors NA
  bin 25 (27)	bmul	ppl 2.11 errors 44.47 seq-errors 100.00
  bin 28 (30)	bmul	ppl NA errors NA seq-errors NA
  bin 31 (33)	bmul	ppl 2.14 errors 44.22 seq-errors 100.00
  bin 34 (36)	bmul	ppl NA errors NA seq-errors NA
  bin 37 (39)	bmul	ppl 2.16 errors 46.27 seq-errors 100.00
  bin 40 (42)	bmul	ppl NA errors NA seq-errors NA
  bin 43 (45)	bmul	ppl 2.17 errors 45.92 seq-errors 100.00
  bin 46 (48)	bmul	ppl NA errors NA seq-errors NA
step 700 step-time 1.04 train-size 0.051 lr 0.100000 grad-norm 1.2822 len 7 ppl 1.305894 errors 24.70 sequence-errors 20.28
  bin 0 (2)	bmul	ppl 1.08 errors 0.00 seq-errors 0.00
  bin 1 (3)	bmul	ppl 1.07 errors 0.00 seq-errors 0.00
  bin 2 (4)	bmul	ppl NA errors NA seq-errors NA
  bin 3 (5)	bmul	ppl 1.09 errors 0.00 seq-errors 0.00
  bin 4 (6)	bmul	ppl NA errors NA seq-errors NA
  bin 7 (9)	bmul	ppl 1.65 errors 25.41 seq-errors 59.38
  bin 10 (12)	bmul	ppl NA errors NA seq-errors NA
  bin 13 (15)	bmul	ppl 1.94 errors 35.41 seq-errors 100.00
  bin 16 (18)	bmul	ppl NA errors NA seq-errors NA
  bin 19 (21)	bmul	ppl 2.08 errors 42.48 seq-errors 100.00
  bin 22 (24)	bmul	ppl NA errors NA seq-errors NA
  bin 25 (27)	bmul	ppl 2.14 errors 44.49 seq-errors 100.00
  bin 28 (30)	bmul	ppl NA errors NA seq-errors NA
  bin 31 (33)	bmul	ppl 2.16 errors 46.73 seq-errors 100.00
  bin 34 (36)	bmul	ppl NA errors NA seq-errors NA
  bin 37 (39)	bmul	ppl 2.20 errors 47.76 seq-errors 100.00
  bin 40 (42)	bmul	ppl NA errors NA seq-errors NA
  bin 43 (45)	bmul	ppl 2.22 errors 49.13 seq-errors 100.00
  bin 46 (48)	bmul	ppl NA errors NA seq-errors NA
step 800 step-time 1.64 train-size 0.051 lr 0.100000 grad-norm 1.3836 len 8 ppl 1.393354 errors 30.55 sequence-errors 30.53
  bin 0 (2)	bmul	ppl 1.08 errors 0.00 seq-errors 0.00
  bin 1 (3)	bmul	ppl 1.08 errors 0.00 seq-errors 0.00
  bin 2 (4)	bmul	ppl NA errors NA seq-errors NA
  bin 3 (5)	bmul	ppl 1.11 errors 0.00 seq-errors 0.00
  bin 4 (6)	bmul	ppl NA errors NA seq-errors NA
  bin 7 (9)	bmul	ppl 1.65 errors 23.67 seq-errors 63.28
  bin 10 (12)	bmul	ppl NA errors NA seq-errors NA
  bin 13 (15)	bmul	ppl 1.91 errors 37.00 seq-errors 97.66
  bin 16 (18)	bmul	ppl NA errors NA seq-errors NA
  bin 19 (21)	bmul	ppl 2.07 errors 43.81 seq-errors 100.00
  bin 22 (24)	bmul	ppl NA errors NA seq-errors NA
  bin 25 (27)	bmul	ppl 2.10 errors 44.66 seq-errors 100.00
  bin 28 (30)	bmul	ppl NA errors NA seq-errors NA
  bin 31 (33)	bmul	ppl 2.13 errors 45.06 seq-errors 100.00
  bin 34 (36)	bmul	ppl NA errors NA seq-errors NA
  bin 37 (39)	bmul	ppl 2.16 errors 45.51 seq-errors 100.00
  bin 40 (42)	bmul	ppl NA errors NA seq-errors NA
  bin 43 (45)	bmul	ppl 2.21 errors 47.88 seq-errors 100.00
  bin 46 (48)	bmul	ppl NA errors NA seq-errors NA

What other attempted solutions have you tried?

ubuntu@ip-10-0-1-131:~/cuda/lib64$ nvidia-smi
Fri Mar 31 17:22:46 2017       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   37C    P0    35W / 125W |     10MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+--------------------------------------------
### Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

@dbl001

This comment has been minimized.

Show comment
Hide comment
@dbl001

dbl001 Mar 31, 2017

ubuntu@ip-10-0-1-131:~/cuda/lib64$ cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery
ubuntu@ip-10-0-1-131:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

modprobe: ERROR: could not insert 'nvidia_375_uvm': Invalid argument
cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL

dbl001 commented Mar 31, 2017

ubuntu@ip-10-0-1-131:~/cuda/lib64$ cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery
ubuntu@ip-10-0-1-131:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

modprobe: ERROR: could not insert 'nvidia_375_uvm': Invalid argument
cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL
@aselle

This comment has been minimized.

Show comment
Hide comment
@aselle

aselle Mar 31, 2017

Member

It seems you are using a mix of cuda 7 and cuda 7.5? Please try CUDA 8.

Member

aselle commented Mar 31, 2017

It seems you are using a mix of cuda 7 and cuda 7.5? Please try CUDA 8.

@dbl001

This comment has been minimized.

Show comment
Hide comment
@dbl001

dbl001 Mar 31, 2017

dbl001 commented Mar 31, 2017

@jjhelmus

This comment has been minimized.

Show comment
Hide comment
@jjhelmus

jjhelmus Mar 31, 2017

CUDA 7.5 requires driver version 352.xx or newer but you seem to be using an older version, can you update the driver?

CUDA 7.5 requires driver version 352.xx or newer but you seem to be using an older version, can you update the driver?

@dbl001

This comment has been minimized.

Show comment
Hide comment
@dbl001

dbl001 Mar 31, 2017

dbl001 commented Mar 31, 2017

@jjhelmus

This comment has been minimized.

Show comment
Hide comment
@jjhelmus

jjhelmus Mar 31, 2017

Ubuntu has a page on how to install the NVidia drivers, https://help.ubuntu.com/community/BinaryDriverHowto/Nvidia. You can check the version using:

cat /proc/driver/nvidia/version

Ubuntu has a page on how to install the NVidia drivers, https://help.ubuntu.com/community/BinaryDriverHowto/Nvidia. You can check the version using:

cat /proc/driver/nvidia/version
@dbl001

This comment has been minimized.

Show comment
Hide comment
@dbl001

dbl001 Mar 31, 2017

dbl001 commented Mar 31, 2017

@dbl001

This comment has been minimized.

Show comment
Hide comment
@dbl001

dbl001 Apr 1, 2017

I successfully installed the NVIDIA 367.57 driver and now 'nvidia-smi' is working:

ubuntu@ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ nvidia-smi
Sat Apr  1 17:30:19 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   41C    P0    77W / 125W |   3800MiB /  4036MiB |     92%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     19821    C   python                                        3798MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ 

dbl001 commented Apr 1, 2017

I successfully installed the NVIDIA 367.57 driver and now 'nvidia-smi' is working:

ubuntu@ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ nvidia-smi
Sat Apr  1 17:30:19 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   41C    P0    77W / 125W |   3800MiB /  4036MiB |     92%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     19821    C   python                                        3798MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ 

@dbl001 dbl001 closed this Apr 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment