Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on calling nvidia-smi: Command 'ps ...' returned non-zero exit status 1 #16

Closed
feiwofeifeixiaowo opened this issue Aug 12, 2017 · 20 comments
Labels
Milestone

Comments

@feiwofeifeixiaowo
Copy link

feiwofeifeixiaowo commented Aug 12, 2017

got above error msg when i run gpustat. but nvidia-smi works on my machine
here are some details
OS:Ubuntu 14.04.5 LTS
Python Version: anaconda3.6

Error on calling nvidia-smi. Use --debug flag for details
Traceback (most recent call last):
  File "/usr/local/bin/gpustat", line 417, in print_gpustat                                                      gpu_stats = GPUStatCollection.new_query()
  File "/usr/local/bin/gpustat", line 245, in new_query
    return GPUStatCollection(gpu_list)
  File "/usr/local/bin/gpustat", line 218, in __init__
    self.update_process_information()
  File "/usr/local/bin/gpustat", line 316, in update_process_information
    processes = self.running_processes()
  File "/usr/local/bin/gpustat", line 275, in running_processes
    ','.join(map(str, pid_map.keys()))
  File "/usr/local/bin/gpustat", line 46, in execute_process
    stdout = check_output(command_shell, shell=True).strip()
  File "/home/xiyun/apps/anaconda3/lib/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "/home/xiyun/apps/anaconda3/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'ps -o pid,user:16,comm -p1 -p 14471' returned non-zero exit status 1.

how can i fix this ?

@wookayin
Copy link
Owner

wookayin commented Aug 12, 2017

Hello, It seems that even though it contains f66cffd, still ps fails. When the problem occurs, what would be the output of ps -o pid,user:16,comm -p1 -p 14471 and nvidia-smi ?

@feiwofeifeixiaowo
Copy link
Author

feiwofeifeixiaowo commented Aug 12, 2017

hey, thanks for your reply. here are two commands's output
ps returned empty result

ps -o pid,user:16,comm -p1 -p 14471
PID USER             COMMAND

nvidia-smi has its normal stat

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:02:00.0      On |                  N/A |
| 22%   50C    P8    15W / 250W |    454MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 22%   59C    P8    18W / 250W |    111MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:82:00.0     Off |                  N/A |
| 27%   67C    P8    18W / 250W |    113MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:83:00.0     Off |                  N/A |
| 52%   82C    P2    97W / 250W |  11666MiB / 12206MiB |     28%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

still dont figure out what happend

@wookayin
Copy link
Owner

wookayin commented Aug 12, 2017

Thanks for your information. I think #12 has the similar cause --- nvidia-smi sometimes returns non-existent process ID (in your case, 14471). Or your ps might be broken.

However, our recent patch didn't work. ps didn't return any result even though it queries PID 1 (which should be init), where my assumption turns out to be wrong. Do you have any idea why your system doesn't have the init process? I will figure out other solutions.

@wookayin wookayin added the bug label Aug 16, 2017
@feiwofeifeixiaowo
Copy link
Author

hey, I figure out why lead this bug and fixed it just now.
Here it is, when i run sudo nvidia-smi command on my machine , then i got correct output from drivers.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:02:00.0      On |                  N/A |
| 38%   79C    P2    97W / 250W |    569MiB / 12205MiB |     49%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 28%   68C    P8    20W / 250W |      1MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:82:00.0     Off |                  N/A |
| 28%   68C    P8    19W / 250W |      1MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:83:00.0     Off |                  N/A |
| 59%   85C    P2   157W / 250W |  11664MiB / 12206MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1741    G   /usr/lib/xorg/Xorg                              20MiB |
|    0     19187    C   python3                                        271MiB |
|    0     19200    C   python3                                        271MiB |
|    3      5775    C   python2                                      11661MiB |
+-----------------------------------------------------------------------------+

this inspired me to run sudo gpustat, the gpu 's state is correct returned like blew. hope this can help guys with same issue.

➜  ~ sudo gpustat
amax  Fri Aug 18 16:11:20 2017
[0] GeForce GTX TITAN X | 79'C,  44 % |   569 / 12205 MB | zjt(271M) zjt(271M)
[1] GeForce GTX TITAN X | 68'C,   0 % |     1 / 12206 MB |
[2] GeForce GTX TITAN X | 68'C,   0 % |     1 / 12206 MB |
[3] GeForce GTX TITAN X | 84'C,  29 % | 11664 / 12206 MB | test1(11661M)

i guess that the reason is this machine which i am using is a multi-user gpu server, i can not see the pid which belong to other user without sudo command.
hope this will make some help to @wookayin

@wookayin
Copy link
Owner

Hi @feiwofeifeixiaowo, thanks for the information! On a multi-user server, we can or can't get other users' process information from time to time. But I don't know when and why. Can you please try it again with nvidia-smi daemon running, e.g. sudo nvidia-smi daemon (but without sudo)?

@Stonesjtu
Copy link
Collaborator

Stonesjtu commented Aug 19, 2017 via email

@wookayin
Copy link
Owner

wookayin commented Aug 19, 2017

nvidia-smi daemon must be run with root privilege.

That's true. I was just wondering if either of pynvml APIs or nvidia-smi daemon could retrieve such information without root privileges.

@feiwofeifeixiaowo
Copy link
Author

@wookayin the return msg show as below with two commands

  • run command nvidia-smi which without sudo
➜  ~ nvidia-smi 
Sat Aug 19 11:22:44 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:02:00.0      On |                  N/A |
| 39%   80C    P2   102W / 250W |   3324MiB / 12205MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 29%   69C    P8    20W / 250W |      1MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:82:00.0     Off |                  N/A |
| 29%   69C    P8    19W / 250W |      1MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:83:00.0     Off |                  N/A |
| 61%   84C    P2    96W / 250W |  11664MiB / 12206MiB |     30%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

  • run command sudo nvidia-smi with sudo show:
➜  ~ sudo nvidia-smi
password for x: 
Sat Aug 19 11:24:02 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:02:00.0      On |                  N/A |
| 39%   80C    P2   103W / 250W |   3324MiB / 12205MiB |     89%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 29%   69C    P8    20W / 250W |      1MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:82:00.0     Off |                  N/A |
| 29%   68C    P8    19W / 250W |      1MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:83:00.0     Off |                  N/A |
| 60%   84C    P2   104W / 250W |  11664MiB / 12206MiB |     91%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1741    G   /usr/lib/xorg/Xorg                              20MiB |
|    0      2171    C   python3                                        271MiB |
|    0      2269    C   python3                                        271MiB |
|    3      5775    C   python2                                      11661MiB |
+-----------------------------------------------------------------------------+

@wookayin
Copy link
Owner

wookayin commented Sep 5, 2017

@feiwofeifeixiaowo Could you please reproduce that the bug still happens on previous versions you were trying with, and that it is now (hopefully) resolved since #20 is merged to master?

@feiwofeifeixiaowo
Copy link
Author

@wookayin , Hello wookayin, it seems that we still got same issue.

➜ ~ gpustat -v

gpustat 0.4.0.dev0                                                                                            

➜ ~ gpustat

Error on querying NVIDIA devices. Use --debug flag for details

➜ ~ nvidia-smi

Wed Sep  6 08:17:11 2017                                                                                     +-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+                              | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:02:00.0      On |                  N/A |
| 22%   53C    P8    17W / 250W |  11661MiB / 12205MiB |      7%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 68%   85C    P2   160W / 250W |  11778MiB / 12206MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:82:00.0     Off |                  N/A |
| 78%   87C    P2   164W / 250W |  11743MiB / 12206MiB |     79%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:83:00.0     Off |                  N/A |
| 33%   73C    P8    22W / 250W |      1MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

➜ ~ sudo nvidia-smi

Wed Sep  6 08:17:20 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:02:00.0      On |                  N/A |
| 22%   53C    P8    17W / 250W |  11661MiB / 12205MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 68%   86C    P2   178W / 250W |  11778MiB / 12206MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:82:00.0     Off |                  N/A |
| 78%   86C    P2   143W / 250W |  11743MiB / 12206MiB |     83%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:83:00.0     Off |                  N/A |
| 33%   73C    P8    22W / 250W |      1MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1741    G   /usr/lib/xorg/Xorg                              29MiB |
|    0     22919    C   /opt/anaconda/bin/python                     11628MiB |
|    1     27881    C   python2                                       2223MiB |
|    1     31912    C   /opt/anaconda/bin/python                      9548MiB |
|    2     32017    C   python2                                      11738MiB |
+-----------------------------------------------------------------------------+

➜ ~ gpustat

Error on querying NVIDIA devices. Use --debug flag for details

@wookayin
Copy link
Owner

wookayin commented Sep 6, 2017

Thanks for your update. I want to have the same environment for myself, but I don't think I have (maybe I have to mock and simulate it). Could you please provide me a stacktrace information by adding --debug flag?

@feiwofeifeixiaowo
Copy link
Author

sorry about that, V.
gpustat --debug

Error on querying NVIDIA devices. Use --debug flag for details
Traceback (most recent call last):
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/psutil/_pslinux.py", line 1332, in wrapper
    return fun(self, *args, **kwargs)
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/psutil/_pslinux.py", line 1506, in create_time
    values = self._parse_stat_file()
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/psutil/_common.py", line 313, in wrapper
    return fun(self)
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/psutil/_pslinux.py", line 1367, in _parse_stat_file
    with open_binary("%s/%s/stat" % (self._procfs_path, self.pid)) as f:
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/psutil/_pslinux.py", line 190, in open_binary
    return open(fname, "rb", **kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/proc/22919/stat'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/psutil/__init__.py", line 408, in _init
    self.create_time()
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/psutil/__init__.py", line 734, in create_time
    self._create_time = self._proc.create_time()
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/psutil/_pslinux.py", line 1338, in wrapper
    raise NoSuchProcess(self.pid, self._name)
psutil.NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=22919)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/gpustat.py", line 371, in print_gpustat
    gpu_stats = GPUStatCollection.new_query()
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/gpustat.py", line 284, in new_query
    gpu_info = get_gpu_info(handle)
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/gpustat.py", line 260, in get_gpu_info
    process = get_process_info(nv_process.pid)
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/gpustat.py", line 221, in get_process_info
    ps_process = psutil.Process(pid=pid)
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/psutil/__init__.py", line 381, in __init__
    self._init(pid)
  File "/home/xiyun/apps/anaconda3/lib/python3.6/site-packages/psutil/__init__.py", line 421, in _init
    raise NoSuchProcess(pid, None, msg)
psutil.NoSuchProcess: psutil.NoSuchProcess no process found with pid 22919

@feiwofeifeixiaowo
Copy link
Author

Hi, @wookayin ,maybe this bug has gone with power outages a few hours ago. ^_^!

After suddenly shutdown with my server, i find that gpustat command works well likes below's output. i don't run any task which may use GPU resource but it work well.

 ➜  ~ gpustat -v
gpustat 0.4.0.dev0
➜  ~ gpustat
amax  Thu Sep  7 16:21:48 2017
[0] GeForce GTX TITAN X | 46'C,   0 % |    27 / 12201 MB |
[1] GeForce GTX TITAN X | 55'C,   0 % | 11664 / 12206 MB | supermerry(11661M)
[2] GeForce GTX TITAN X | 87'C,  83 % | 11743 / 12206 MB | test1(11738M)
[3] GeForce GTX TITAN X | 84'C,  83 % | 11669 / 12206 MB | test1(11664M)

@Stonesjtu
Copy link
Collaborator

@feiwofeifeixiaowo I googled a lot this problem and it's likely caused by the broken context of some CUDA applications, thus NVML will return pids that doesn't exist at all.

@wookayin we have to deal with this bug even in our py-nvml based version, psutil.Process(pid=non_exist_pid) raises an exception.

gpustat/gpustat.py

Lines 218 to 230 in 895e1f8

def get_process_info(pid):
"""Get the process information of specific pid"""
process = {}
ps_process = psutil.Process(pid=pid)
process['username'] = ps_process.username()
# cmdline returns full path; as in `ps -o comm`, get short cmdnames.
process['command'] = os.path.basename(ps_process.cmdline()[0])
# Bytes to MBytes
process['gpu_memory_usage'] = int(nv_process.usedGpuMemory / 1024 / 1024)
process['pid'] = nv_process.pid
return process
def _decode(b):

FYI: https://devtalk.nvidia.com/default/topic/958159/11-gb-of-gpu-ram-used-and-no-process-listed-by-nvidia-smi/

@wookayin
Copy link
Owner

wookayin commented Sep 11, 2017

@Stonesjtu You are absolutely correct. Thanks for the detailed information on why it happens --
your PR #23 has been merged.

I think this issue can be closed now. @feiwofeifeixiaowo Can you please double-check on this? Thanks all!

@wookayin wookayin added this to the 0.4 milestone Nov 2, 2017
@wookayin
Copy link
Owner

wookayin commented Nov 2, 2017

I assume that this is now fixed by v0.4.0. Please re-open it or open a new one if you have any problems on this.

@wookayin wookayin closed this as completed Nov 2, 2017
@Lmy0217
Copy link

Lmy0217 commented Dec 2, 2017

@wookayin @Stonesjtu
I also got this error in Ubuntu 17.04 with Python 2.7 when I run gpustat or sudo gpustat.

Error on querying NVIDIA devices. Use --debug flag for details
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gpustat.py", line 454, in print_gpustat
    gpu_stats = GPUStatCollection.new_query()
  File "/usr/local/lib/python2.7/dist-packages/gpustat.py", line 355, in new_query
    gpu_info = get_gpu_info(handle)
  File "/usr/local/lib/python2.7/dist-packages/gpustat.py", line 341, in get_gpu_info
    'enforced.power.limit': int(power_limit / 1000) if power is not None else None,
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

How can I fix it?

@Stonesjtu
Copy link
Collaborator

I think this line of code if power is not None should be if power_limit is not None.
Can you change this and test if the patch works.

@wookayin
Copy link
Owner

wookayin commented Dec 2, 2017

@Lmy0217 You should have open an new issue. Anyway, I have fixed it in 5c50d9f (current master). Sorry for the bug, and thanks for your report!

@wookayin
Copy link
Owner

wookayin commented Dec 2, 2017

Released as v0.4.1.

@wookayin wookayin changed the title Error on calling nvidia-smi. Use --debug flag for details Error on calling nvidia-smi: Command 'ps ...' returned non-zero exit status 1 Dec 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants