Add MIG support #102

armandpicard · 2021-06-22T16:05:45Z

When using gpustat with MIG(Multi-Instance GPU) in Kubernetes we are not able to get metrics.

When running gpustat we get the main gpu name but not metrics about RAM.

This is due to the lack of permission on the root GPU.
We could get information about MIGs when listing a MIG enabled GPU. This could give information more information like ram but not compute utilization since it's not yet implemented in NVML.
This lead to issues in ray when getting metrics on GPU.

A PR will follow

wookayin · 2021-07-02T08:30:59Z

Thanks for reporting this. Would you able to get any relevant information from raw nvidia-smi command? If you have any good reference/documentation about it, that'd be also helpful.

wookayin · 2021-07-29T13:11:11Z

The issue is that the pynvml library we are relying on is not aware of MIG. One dirty but quick workaround would be parsing nvidia-smi output, but this doesn't seem right.

So I'll have to re-implement the low-level library on our own, apart from pynvml, using nvidia's official NVML C API, or simply fork pynvml and add MIG support. Another pynvml binding also lacks MIG support yet. This will be a non-trivial work but I'll be happy to work on it.

NVML API Reference: https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference
nvml.h: https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/bindings/go/nvml/nvml.h

wookayin · 2021-07-29T13:19:21Z

I realized that there is now an NVIDIA official python nvml binding pynvml: https://pypi.org/project/nvidia-ml-py/ (being actively maintained) which has MIG support. So we'll have to switch to this pynvml library, then adding MIG support won't be difficult to implement (although I don't have any A100/MIG GPU available). Please stay tuned!

armandpicard · 2021-07-29T13:31:42Z

I've try to use this library here https://github.com/instadeepai/gpustat/tree/fix-add-MIG-support but some functions do not support MIG for the moment like nvmlDeviceGetUtilizationRates( https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g540824faa6cef45500e0d1dc2f50b321)

wookayin · 2021-07-29T15:26:14Z

The API documentation and user guide say

On MIG-enabled GPUs, querying device utilization rates is not currently supported.

It looks like DCGM is the only way to go. Can you try some command line tools (nvidia-smi or dcgmi) to see if it's possible to get some numbers like GPU utilization as you want?

Reference:

XuehaiPan · 2021-08-04T11:55:27Z

The official implementation of NVML Python Bindings (nvidia-ml-py) added MIG support since nvidia-ml-py>=11.450.129. But this will cause NVMLError_FunctionNotFound at _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2") for old NVIDIA drivers:

# Added in 2.285
def nvmlDeviceGetComputeRunningProcesses_v2(handle):
    # first call to get the size
    c_count = c_uint(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
    ret = fn(handle, byref(c_count), None)

    ...

def nvmlDeviceGetComputeRunningProcesses(handle):
    return nvmlDeviceGetComputeRunningProcesses_v2(handle);  # no v1 version now!!! incompatible with old drivers

And gpustat will not be able to gather process infos correctly on old machines if we arbitrarily update nvidia-ml-py's version in gpustat's requirements.

If we want to keep the simple install instruction pip3 install gpustat, it could be hard to determine which version of nvidia-ml-py should be installed as a dependency before gpustat is installed.

wookayin · 2021-08-04T12:39:53Z

@XuehaiPan Thanks; we will be using the official python bindings, which I already implemented but will be pushed quite soon. I wasn't aware that there is such a backward incompatibility around nvmlDeviceGetComputeRunningProcesses_v2. (See #105 as well)

So we must check this carefully with "old" GPU cards or "old" NVIDIA driver --- I wonder what is the exact setup to break. Also we might need to workaround this. One possible way is to monkey patch _nvmlGetFunctionPointer_cache; I will try a bit more and keep you posted.

XuehaiPan · 2021-08-04T13:36:29Z

So we must check this carefully with "old" GPU cards or "old" NVIDIA driver --- I wonder what is the exact setup to break. Also we might need to workaround this.

On Ubuntu 16.04 LTS, the highest supported version of NVIDIA driver is nvidia-430:

$ cat /etc/*-release  
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.4 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

$ apt list --installed | grep nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

nvidia-430/now 430.64-0ubuntu0~gpu16.04.2 amd64 [installed,upgradable to: 430.64-0ubuntu0~gpu16.04.6]
nvidia-cuda-doc/xenial,xenial,now 7.5.18-0ubuntu1 all [installed]
nvidia-cuda-gdb/xenial,now 7.5.18-0ubuntu1 amd64 [installed]
nvidia-opencl-dev/xenial,now 7.5.18-0ubuntu1 amd64 [installed]
nvidia-opencl-icd-430/now 430.64-0ubuntu0~gpu16.04.2 amd64 [installed,upgradable to: 430.64-0ubuntu0~gpu16.04.6]
nvidia-prime/xenial,now 0.8.2 amd64 [installed,automatic]
nvidia-settings/xenial,now 361.42-0ubuntu1 amd64 [installed,upgradable to: 418.56-0ubuntu0~gpu16.04.1]

Although Ubuntu 16.04 LTS has reached the end of its five-year LTS window on April 30th 2021. It is still widely used in industry and research laboratories due to poor IT services :(.

nvidia-ml-py==11.450.51 works fine on Ubuntu 16.04, but it does not have binding functions for MIG support.

$ pip3 install ipython nvidia-ml-py==11.450.51

$ ipython3                             
Python 3.9.6 (default, Jun 28 2021, 08:57:49) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.26.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pynvml import *

In [2]: nvmlInit()

In [3]: nvmlSystemGetDriverVersion()
Out[3]: b'430.64'

In [4]: handle = nvmlDeviceGetHandleByIndex(0)

In [5]: nvmlDeviceGetComputeRunningProcesses(handle)
Out[5]: []

In [6]: nvmlDeviceGetGraphicsRunningProcesses(handle)
Out[6]: [<pynvml.nvmlFriendlyObject at 0x7fb2a4d1c400>]

In [7]: list(map(str, nvmlDeviceGetGraphicsRunningProcesses(handle)))
Out[7]: ["{'pid': 1876, 'usedGpuMemory': 17580032}"]

nvidia-ml-py>=11.450.129 has added binding functions for MIG support, but it will raise errors when querying the PIDs with old drivers. Users should downgrade nvidia-ml-py manually to handle this or upgrade the NVIDIA driver (** privileges required **).

$ pip3 install ipython nvidia-ml-py==11.450.129

$ ipython3
Python 3.9.6 (default, Jun 28 2021, 08:57:49) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.26.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pynvml import *

In [2]: nvmlInit()

In [3]: nvmlSystemGetDriverVersion()
Out[3]: b'430.64'

In [4]: handle = nvmlDeviceGetHandleByIndex(0)

In [5]: nvmlDeviceGetComputeRunningProcesses(handle)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/site-packages/pynvml.py in _nvmlGetFunctionPointer(name)
    719         try:
--> 720             _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
    721             return _nvmlGetFunctionPointer_cache[name]

~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/ctypes/__init__.py in __getattr__(self, name)
    386             raise AttributeError(name)
--> 387         func = self.__getitem__(name)
    388         setattr(self, name, func)

~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/ctypes/__init__.py in __getitem__(self, name_or_ordinal)
    391     def __getitem__(self, name_or_ordinal):
--> 392         func = self._FuncPtr((name_or_ordinal, self))
    393         if not isinstance(name_or_ordinal, int):

AttributeError: /usr/lib/nvidia-430/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2

During handling of the above exception, another exception occurred:

NVMLError_FunctionNotFound                Traceback (most recent call last)
<ipython-input-4-ef8a5a47bcb8> in <module>
----> 1 nvmlDeviceGetComputeRunningProcesses(handle)

~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/site-packages/pynvml.py in nvmlDeviceGetComputeRunningProcesses(handle)
   2093 
   2094 def nvmlDeviceGetComputeRunningProcesses(handle):
-> 2095     return nvmlDeviceGetComputeRunningProcesses_v2(handle);
   2096 
   2097 def nvmlDeviceGetGraphicsRunningProcesses_v2(handle):

~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/site-packages/pynvml.py in nvmlDeviceGetComputeRunningProcesses_v2(handle)
   2061     # first call to get the size
   2062     c_count = c_uint(0)
-> 2063     fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
   2064     ret = fn(handle, byref(c_count), None)
   2065 

~/.linuxbrew/Cellar/python@3.9/3.9.6/lib/python3.9/site-packages/pynvml.py in _nvmlGetFunctionPointer(name)
    721             return _nvmlGetFunctionPointer_cache[name]
    722         except AttributeError:
--> 723             raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
    724     finally:
    725         # lock is always freed

NVMLError_FunctionNotFound: Function Not Found

wookayin · 2021-08-04T16:25:36Z

So I think the driver version is old, not the graphic card. BTW it is recommended to install nvidia drivers from official binary. (Although gpustat can still support such legacy drivers)

With the old nvidia drivers, however, can you try the following?

v1 = pynvml._nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses")
v2 = pynvml._nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")

I guess v1 will still work but v2 will raise as you already showed in the stacktrace. In my environments with a recent version of nvidia driver, both work.

XuehaiPan · 2021-08-04T17:19:48Z

I guess v1 will still work but v2 will raise as you already showed in the stacktrace. In my environments with a recent version of nvidia driver, both work.


v1	NVIDIA 430.64	NVIDIA 470.57.02
`nvidia-ml-py==11.450.51`	works but without `CI ID` / `GI ID`	works but without `CI ID` / `GI ID`
`nvidia-ml-py>=11.450.129`	no exceptions in Python but gets wrong results (subscript out of range in C library)	no exceptions in Python but gets wrong results (subscript out of range in C library)

v2	NVIDIA 430.64	NVIDIA 470.57.02
`nvidia-ml-py==11.450.51`	function not found	no exceptions in Python but gets wrong results (subscript out of range in C library)
`nvidia-ml-py>=11.450.129`	function not found	works with correct `CI ID` / `GI ID`

wookayin · 2021-08-05T04:01:37Z

@XuehaiPan ~~Can you elaborate on the meaning of CI ID / GI ID?~~
Got it. Compute Instance (CI) ID and GPU Instance (GI) ID.

So I think falling back to v1 function for old drivers will be the best option to make obtaining process information work in either case.

starry91 · 2023-03-09T06:01:44Z

The official implementation of NVML Python Bindings (nvidia-ml-py) added MIG support since nvidia-ml-py>=11.450.129

@XuehaiPan @wookayin Correct me if I am wrong, but as per my understanding the only support here is w.r.t to memory usage and MIG profile related info and not the utilization stats. Last I checked, DCGM was the only way to get the utilization stats for MIG enabled devices.

diricxbart · 2023-08-10T08:16:43Z

Hello @wookayin,
would you have an update on this topic?
I would need MIG support for this in the coming weeks and would like to know if we should start looking into other solutions...
I could assist in validating if needed based on an NVidia A100

wookayin mentioned this issue Jul 29, 2021

MIG support - GPU utilization, etc. #106

Closed

wookayin added contributions welcome new feature labels Jul 29, 2021

wookayin mentioned this issue Aug 5, 2021

Use NVIDIA's official pynvml binding #107

Merged

XuehaiPan mentioned this issue Aug 10, 2021

[Feature Request] MIG device support (e.g. A100 GPUs) XuehaiPan/nvitop#5

Closed

wookayin added this to the 1.0 milestone Apr 30, 2022

wookayin mentioned this issue Jul 5, 2022

Not supported: no process information #72

Closed

This comment was marked as off-topic.

Sign in to view

wookayin modified the milestones: 1.0, 1.1 Sep 4, 2022

wookayin mentioned this issue Oct 15, 2022

Add MiG devices handling #120

Closed

wookayin modified the milestones: 1.1, 1.2 Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MIG support #102

Add MIG support #102

armandpicard commented Jun 22, 2021 •

edited

Loading

wookayin commented Jul 2, 2021

wookayin commented Jul 29, 2021 •

edited

Loading

wookayin commented Jul 29, 2021 •

edited

Loading

armandpicard commented Jul 29, 2021

wookayin commented Jul 29, 2021 •

edited

Loading

XuehaiPan commented Aug 4, 2021 •

edited

Loading

wookayin commented Aug 4, 2021 •

edited

Loading

XuehaiPan commented Aug 4, 2021 •

edited

Loading

wookayin commented Aug 4, 2021 •

edited

Loading

XuehaiPan commented Aug 4, 2021 •

edited

Loading

wookayin commented Aug 5, 2021 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

starry91 commented Mar 9, 2023

diricxbart commented Aug 10, 2023

Add MIG support #102

Add MIG support #102

Comments

armandpicard commented Jun 22, 2021 • edited Loading

wookayin commented Jul 2, 2021

wookayin commented Jul 29, 2021 • edited Loading

wookayin commented Jul 29, 2021 • edited Loading

armandpicard commented Jul 29, 2021

wookayin commented Jul 29, 2021 • edited Loading

XuehaiPan commented Aug 4, 2021 • edited Loading

wookayin commented Aug 4, 2021 • edited Loading

XuehaiPan commented Aug 4, 2021 • edited Loading

wookayin commented Aug 4, 2021 • edited Loading

XuehaiPan commented Aug 4, 2021 • edited Loading

wookayin commented Aug 5, 2021 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

starry91 commented Mar 9, 2023

diricxbart commented Aug 10, 2023

armandpicard commented Jun 22, 2021 •

edited

Loading

wookayin commented Jul 29, 2021 •

edited

Loading

wookayin commented Jul 29, 2021 •

edited

Loading

wookayin commented Jul 29, 2021 •

edited

Loading

XuehaiPan commented Aug 4, 2021 •

edited

Loading

wookayin commented Aug 4, 2021 •

edited

Loading

XuehaiPan commented Aug 4, 2021 •

edited

Loading

wookayin commented Aug 4, 2021 •

edited

Loading

XuehaiPan commented Aug 4, 2021 •

edited

Loading

wookayin commented Aug 5, 2021 •

edited

Loading