Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVLink Throughput and Timeline throwing 500's and Errors in lab interface #28

Closed
ericdill opened this issue Oct 17, 2019 · 10 comments 路 Fixed by #133
Closed

NVLink Throughput and Timeline throwing 500's and Errors in lab interface #28

ericdill opened this issue Oct 17, 2019 · 10 comments 路 Fixed by #133
Labels
bug Something isn't working

Comments

@ericdill
Copy link

This is an awesome project. Thanks for the hard work here! It's really nice to have a dashboard for watching GPU resources (and is way better than opening up a terminal and running watch nvidia-smi 馃榾 )

I'll preface this issue with this is mostly just some user feedback. Do with it what you will :). I'm happy to help debug further, but have zero ability to actually write JLab extensions so can't help on the writing-code-to-help-fix side of things.

I'm brand new to using this extension (installed it like 15 mins ago) and was clicking around seeing what all the different dashboards do. When I open up the NVLink Throughput and NVLink Timeline dashboards, I immediately get stack traces in my jupyter server logs and a "500: Internal Server Error" in the jupyterlab widget. This is almost certainly because I'm not running on a multi-gpu system.

NVLink jupterlab 500 in the opened panel:
Oct 17-11 31 05

server logs from NVLink error:

ERROR:tornado.access:500 GET /NVLink-Throughput (127.0.0.1) 1.00ms
[E 11:27:00.965 LabApp] {
      "Host": "localhost:8888",
      "Connection": "keep-alive",
      "Upgrade-Insecure-Requests": "1",
      "Dnt": "1",
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.75 Safari/537.36",
      "Sec-Fetch-Mode": "nested-navigate",
      "Sec-Fetch-User": "?1",
      "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
      "Sec-Fetch-Site": "same-origin",
      "Accept-Encoding": "gzip, deflate, br",
      "Accept-Language": "en-US,en;q=0.9",
      "Cookie": "_xsrf=2|a800529e|3734e978fe7758a227e33d8ef289c566|1571322840; username-localhost-8888=\"2|1:0|10:1571326020|23:username-localhost-8888|44:MDg0NGY3NjY4OTMzNDdlMGI1MDQ5NmIwYjM0NmJjYTY=|969f92a7df270d40c831ddb30a0c7dfba20c443d37adaa36bea79c3fb78891a4\""
    }
[E 11:27:00.965 LabApp] 500 GET /nvdashboard/NVLink-Throughput (127.0.0.1) 18.59ms referer=None
ERROR:tornado.application:Uncaught exception GET /NVLink-Timeline (127.0.0.1)
HTTPServerRequest(protocol='http', host='localhost:8888', method='GET', uri='/NVLink-Timeline', version='HTTP/1.1', remote_ip='127.0.0.1')
Traceback (most recent call last):
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/web.py", line 1699, in _execute
    result = await result
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/server/views/doc_handler.py", line 55, in get
    session = yield self.get_session()
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/server/views/session_handler.py", line 77, in get_session
    session = yield self.application_context.create_session_if_needed(session_id, self.request)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/server/contexts.py", line 215, in create_session_if_needed
    self._application.initialize_document(doc)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/application/application.py", line 178, in initialize_document
    h.modify_document(doc)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/bokeh/application/handlers/function.py", line 133, in modify_document
    self._func(doc)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 395, in nvlink_timeline
    for i in range(ngpus)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 395, in <listcomp>
    for i in range(ngpus)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 392, in <listcomp>
    for j in range(nlinks)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/pynvml/nvml.py", line 1999, in nvmlDeviceGetNvLinkUtilizationCounter
    check_return(ret)
  File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/pynvml/nvml.py", line 366, in check_return
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR:tornado.access:500 GET /NVLink-Timeline (127.0.0.1) 38.24ms

Oh, interesting. Every time I switch between different tabs in jupyterlab, it seems like the dashboard needs to reconnect to the websocket. Sometimes this also throws an exception in the jupyter server logs. (Clearly the workaround is to have all of the dashboards exposed and not in tabs)

Websocket error:

[E 11:34:44.378 LabApp] Uncaught exception
    Traceback (most recent call last):
      File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/websocket.py", line 649, in _run_callback
        result = callback(*args, **kwargs)
      File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/websocket.py", line 1528, in on_message
        return self._on_message(message)
      File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/websocket.py", line 1534, in _on_message
        self._on_message_callback(message)
      File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/jupyter_server_proxy/handlers.py", line 247, in message_cb
        self.write_message(message, binary=isinstance(message, bytes))
      File "/home/ericdill/miniconda/envs/jupyter/lib/python3.7/site-packages/tornado/websocket.py", line 339, in write_message
        raise WebSocketClosedError()
    tornado.websocket.WebSocketClosedError
WARNING:bokeh.server.views.ws:Failed sending message as connection was closed
WARNING:bokeh.server.views.ws:Failed sending message as connection was closed
WARNING:bokeh.server.views.ws:Failed sending message as connection was closed
[I 11:34:45.794 LabApp] Trying to establish websocket connection to ws://localhost:36330/GPU-Memory/ws?bokeh-protocol-version=1.0&bokeh-session-id=zz8FYsgbMiW8RnVQwCKUJolWg5YcUxMupkjdHkU8nL4G
[I 11:34:45.847 LabApp] Websocket connection established to ws://localhost:36330/GPU-Memory/ws?bokeh-protocol-version=1.0&bokeh-session-id=zz8FYsgbMiW8RnVQwCKUJolWg5YcUxMupkjdHkU8nL4G

Current environment:

# packages in environment at /home/ericdill/miniconda/envs/rapidsai:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
arrow-cpp                 0.14.1           py37h6b969ab_1    conda-forge
backcall                  0.1.0                      py_0    conda-forge
boost-cpp                 1.70.0               h8e57a91_2    conda-forge
brotli                    1.0.7             he1b5a44_1000    conda-forge
bzip2                     1.0.8                h516909a_0    conda-forge
c-ares                    1.15.0            h516909a_1001    conda-forge
ca-certificates           2019.6.16            hecc5488_0    conda-forge
certifi                   2019.6.16                py37_1    conda-forge
cffi                      1.12.3           py37h8022711_0    conda-forge
cudatoolkit               10.0.130                      0  
cudf                      0.9.0                    py37_0    rapidsai
cugraph                   0.9.0                    py37_0    rapidsai
cuml                      0.9.1           cuda10.0_py37_0    rapidsai
cython                    0.29.13          py37he1b5a44_0    conda-forge
decorator                 4.4.0                      py_0    conda-forge
dlpack                    0.2                  he1b5a44_0    conda-forge
double-conversion         3.1.5                he1b5a44_1    conda-forge
fastavro                  0.22.4           py37h516909a_0    conda-forge
gflags                    2.2.2             he1b5a44_1001    conda-forge
glog                      0.4.0                he1b5a44_1    conda-forge
grpc-cpp                  1.23.0               h18db393_0    conda-forge
icu                       64.2                 he1b5a44_1    conda-forge
ipykernel                 5.1.2            py37h5ca1d4c_0    conda-forge
ipython                   7.8.0            py37h5ca1d4c_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.15.1                   py37_0    conda-forge
jupyter_client            5.3.1                      py_0    conda-forge
jupyter_core              4.4.0                      py_0    conda-forge
libblas                   3.8.0               12_openblas    conda-forge
libcblas                  3.8.0               12_openblas    conda-forge
libcudf                   0.9.0                cuda10.0_0    rapidsai
libcugraph                0.9.0                cuda10.0_0    rapidsai
libcuml                   0.9.1                cuda10.0_0    rapidsai
libcumlprims              0.9.0                cuda10.0_0    nvidia
libevent                  2.1.10               h72c5cf5_0    conda-forge
libffi                    3.2.1             he1b5a44_1006    conda-forge
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_0  
liblapack                 3.8.0               12_openblas    conda-forge
libnvstrings              0.9.0                cuda10.0_0    rapidsai
libopenblas               0.3.7                h6e990d7_1    conda-forge
libprotobuf               3.8.0                h8b12597_0    conda-forge
librmm                    0.9.0                cuda10.0_0    rapidsai
libsodium                 1.0.17               h516909a_0    conda-forge
libstdcxx-ng              9.1.0                hdf63c60_0  
llvmlite                  0.29.0           py37hf484d3e_0    numba
lz4-c                     1.8.3             he1b5a44_1001    conda-forge
nccl                      2.4.6.1              cuda10.0_0    nvidia
ncurses                   6.1               hf484d3e_1002    conda-forge
numba                     0.45.1          np116py37hf484d3e_0    numba
numpy                     1.16.4           py37h95a1406_0    conda-forge
nvstrings                 0.9.0                    py37_0    rapidsai
openssl                   1.1.1c               h516909a_0    conda-forge
pandas                    0.24.2           py37hb3f55d8_0    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
parso                     0.5.1                      py_0    conda-forge
pexpect                   4.7.0                    py37_0    conda-forge
pickleshare               0.7.5                 py37_1000    conda-forge
pip                       19.2.3                   py37_0    conda-forge
prompt_toolkit            2.0.9                      py_0    conda-forge
ptyprocess                0.6.0                   py_1001    conda-forge
pyarrow                   0.14.1           py37h8b68381_0    conda-forge
pycparser                 2.19                     py37_1    conda-forge
pygments                  2.4.2                      py_0    conda-forge
python                    3.7.3                h33d41f4_1    conda-forge
python-dateutil           2.8.0                      py_0    conda-forge
pytz                      2019.2                     py_0    conda-forge
pyzmq                     18.0.2           py37h1768529_2    conda-forge
re2                       2019.09.01           he1b5a44_0    conda-forge
readline                  8.0                  hf8c457e_0    conda-forge
rmm                       0.9.0                    py37_0    rapidsai
setuptools                41.2.0                   py37_0    conda-forge
six                       1.12.0                py37_1000    conda-forge
snappy                    1.1.7             he1b5a44_1002    conda-forge
sqlite                    3.29.0               hcee41ef_1    conda-forge
thrift-cpp                0.12.0            hf3afdfd_1004    conda-forge
tk                        8.6.9             hed695b0_1002    conda-forge
tornado                   6.0.3            py37h516909a_0    conda-forge
traitlets                 4.3.2                 py37_1000    conda-forge
uriparser                 0.9.3                he1b5a44_1    conda-forge
wcwidth                   0.1.7                      py_1    conda-forge
wheel                     0.33.6                   py37_0    conda-forge
xz                        5.2.4             h14c3975_1001    conda-forge
zeromq                    4.3.2                he1b5a44_2    conda-forge
zlib                      1.2.11            h516909a_1005    conda-forge
zstd                      1.4.0                h3b9ef0a_0    conda-forge

Installed extension with pip install jupyterlab-nvdashboard and then jupyter labextension install jupyterlab-nvdashboard

@jacobtomlinson jacobtomlinson added the bug Something isn't working label Oct 17, 2019
@jacobtomlinson
Copy link
Member

Thanks for raising this!

Yeah it looks like we should catch that pynvml.nvml.NVMLError_NotSupported: Not Supported error on systems without NVLink and show a sensible message to the user in the front end.

@jhgoebbert
Copy link

I can confirm, that this is an issue for everyone without NVLink.
This error fills the jupyterlab-logs multiple times a second:

ERROR:bokeh.util.tornado:Error thrown from periodic callback:
ERROR:bokeh.util.tornado:Traceback (most recent call last):
  File "/usr/local/software/jureca/Stages/Devel-2019a/software/Jupyter/2019a-gcccoremkl-8.3.0-2019.3.199-devel-Python-3.6.8/lib/python3.6/site-packages/tornado-6.0.3-py3.6-linux-x86_64.egg/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/usr/local/software/jureca/Stages/Devel-2019a/software/Jupyter/2019a-gcccoremkl-8.3.0-2019.3.199-devel-Python-3.6.8/lib/python3.6/site-packages/tornado-6.0.3-py3.6-linux-x86_64.egg/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/usr/local/software/jureca/Stages/Devel-2019a/software/Jupyter/2019a-gcccoremkl-8.3.0-2019.3.199-devel-Python-3.6.8/lib/python3.6/site-packages/bokeh-1.3.4-py3.6.egg/bokeh/server/session.py", line 70, in _needs_document_lock_wrapper
    result = yield yield_for_all_futures(func(self, *args, **kwargs))
  File "/usr/local/software/jureca/Stages/Devel-2019a/software/Jupyter/2019a-gcccoremkl-8.3.0-2019.3.199-devel-Python-3.6.8/lib/python3.6/site-packages/bokeh-1.3.4-py3.6.egg/bokeh/server/session.py", line 191, in with_document_locked
    return func(*args, **kwargs)
  File "/usr/local/software/jureca/Stages/Devel-2019a/software/Jupyter/2019a-gcccoremkl-8.3.0-2019.3.199-devel-Python-3.6.8/lib/python3.6/site-packages/bokeh-1.3.4-py3.6.egg/bokeh/document/document.py", line 1127, in wrapper
    return doc._with_self_as_curdoc(invoke)
  File "/usr/local/software/jureca/Stages/Devel-2019a/software/Jupyter/2019a-gcccoremkl-8.3.0-2019.3.199-devel-Python-3.6.8/lib/python3.6/site-packages/bokeh-1.3.4-py3.6.egg/bokeh/document/document.py", line 1113, in _with_self_as_curdoc
    return f()
  File "/usr/local/software/jureca/Stages/Devel-2019a/software/Jupyter/2019a-gcccoremkl-8.3.0-2019.3.199-devel-Python-3.6.8/lib/python3.6/site-packages/bokeh-1.3.4-py3.6.egg/bokeh/document/document.py", line 1126, in invoke
    return f(*args, **kwargs)
  File "/usr/local/software/jureca/Stages/Devel-2019a/software/Jupyter/2019a-gcccoremkl-8.3.0-2019.3.199-devel-Python-3.6.8/lib/python3.6/site-packages/jupyterlab_nvdashboard-0.2.0-py3.6.egg/jupyterlab_nvdashboard/apps/gpu.py", line 576, in cb
    gpu_handles[i], pynvml.NVML_PCIE_UTIL_TX_BYTES
  File "/usr/local/software/jureca/Stages/Devel-2019a/software/Jupyter/2019a-gcccoremkl-8.3.0-2019.3.199-devel-Python-3.6.8/lib/python3.6/site-packages/pynvml-8.0.4-py3.6.egg/pynvml/nvml.py", line 1793, in nvmlDeviceGetPcieThroughput
    check_return(ret)
  File "/usr/local/software/jureca/Stages/Devel-2019a/software/Jupyter/2019a-gcccoremkl-8.3.0-2019.3.199-devel-Python-3.6.8/lib/python3.6/site-packages/pynvml-8.0.4-py3.6.egg/pynvml/nvml.py", line 366, in check_return
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported

@supertetelman
Copy link

supertetelman commented Jul 21, 2020

I am running the latest version of this dashbaord on a DGX Station with nvlink and I am still seeing this error. Is there a specific driver version I need?

It looks like I am seeing an issue with a different metric than the previous user nvmlDeviceGetNvLinkUtilizationCounter. I remember seeing a related bug/change with some related metrics in the driver, so maybe this API has actually changed.

ERROR:tornado.application:Uncaught exception GET /NVLink-Throughput (127.0.0.1)
HTTPServerRequest(protocol='http', host='sae-npn-01:8899', method='GET', uri='/NVLink-Throughput', version='HTTP/1.1', remote_ip='127.0.0.1')
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
    result = await result
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/server/views/doc_handler.py", line 56, in get
    session = yield self.get_session()
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/server/views/session_handler.py", line 79, in get_session
    session = yield self.application_context.create_session_if_needed(session_id, self.request)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/server/contexts.py", line 222, in create_session_if_needed
    self._application.initialize_document(doc)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/application/application.py", line 178, in initialize_document
    h.modify_document(doc)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/application/handlers/function.py", line 133, in modify_document
    self._func(doc)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 233, in nvlink
    for i in range(ngpus)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 233, in <listcomp>
    for i in range(ngpus)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 230, in <listcomp>
    for j in range(nlinks)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/pynvml/nvml.py", line 2006, in nvmlDeviceGetNvLinkUtilizationCounter
    check_return(ret)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/pynvml/nvml.py", line 366, in check_return
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR:tornado.access:500 GET /NVLink-Throughput (127.0.0.1) 43.87ms
ERROR:tornado.application:Uncaught exception GET /NVLink-Timeline (127.0.0.1)
HTTPServerRequest(protocol='http', host='sae-npn-01:8899', method='GET', uri='/NVLink-Timeline', version='HTTP/1.1', remote_ip='127.0.0.1')
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
    result = await result
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/server/views/doc_handler.py", line 56, in get
    session = yield self.get_session()
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/server/views/session_handler.py", line 79, in get_session
    session = yield self.application_context.create_session_if_needed(session_id, self.request)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/server/contexts.py", line 222, in create_session_if_needed
    self._application.initialize_document(doc)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/application/application.py", line 178, in initialize_document
    h.modify_document(doc)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/bokeh/application/handlers/function.py", line 133, in modify_document
    self._func(doc)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 395, in nvlink_timeline
    for i in range(ngpus)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 395, in <listcomp>
    for i in range(ngpus)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 392, in <listcomp>
    for j in range(nlinks)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/pynvml/nvml.py", line 2006, in nvmlDeviceGetNvLinkUtilizationCounter
    check_return(ret)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/pynvml/nvml.py", line 366, in check_return
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR:tornado.access:500 GET /NVLink-Timeline (127.0.0.1) 38.54ms
...
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
...
# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      NV1     NV1     NV2     0-39            N/A
GPU1    NV1      X      NV2     NV1     0-39            N/A
GPU2    NV1     NV2      X      NV1     0-39            N/A
GPU3    NV2     NV1     NV1      X      0-39            N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Running this Dockerfile:

# https://ngc.nvidia.com/catalog/containers/nvidia:rapidsai:rapidsai
FROM nvcr.io/nvidia/rapidsai/rapidsai:cuda10.2-runtime-ubuntu18.04

ENTRYPOINT ["/bin/sh"]
CMD ["-c", "/opt/conda/envs/rapids/bin/jupyter lab  --notebook-dir=/rapids --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]
``

@rjzamora
Copy link
Member

I plan to take some time to look into the various issues we are seeing for NVLink metrics in pynvml this week. If the NVML API has indeed changed, we should certainly update the wrappers. Overall, I do not think the "NVLink Throughput" dashboard has ever been very reliable, and I'd like to improve behavior on the backend.

Also, I definitley think it makes sense for us to inform the user more gracefully if/when NVLink metrics are not supported.

@davidrpugh
Copy link

I have also observed this issue on a system with NVLINK after our system was upgraded with new drivers and CUDA Version.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro K6000        On   | 00000000:3D:00.0 Off |                    0 |
| 27%   45C    P0    70W / 225W |    721MiB / 11441MiB |     45%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

@davidrpugh
Copy link

@rjzamora is there anyway I can help on this? My HPC center recently upgraded to CUDA Version: 11.0 on the system and this seems to have broken the NVLINK dashboards for reasons that I don't understand. This functionality was super helpful for profiling and optimizing distributed deep learning workloads on our system. Is this an issue with pynvml?

@rjzamora
Copy link
Member

rjzamora commented Dec 8, 2020

Sorry for letting this slip, and for missing your last comment @davidrpugh !

The only reason I haven't looked into this yet is that I don't personally have access to a machine with cuda 11 (so I have not personally observed the problem). This seems like more motivation for gpu CI :)

I will try harder to get access to something remotely, and I should be getting a local machine with 11 support in a few weeks. Others who already have access to 11 should feel free to look into this.

@davidrpugh
Copy link

@rjzamora what is the best way to contribute to the solution here. Seems like the issue is with pynvml and not with jupyterlab_nvdashboard. Is it worth opening an issue there?

@rjzamora
Copy link
Member

rjzamora commented Jan 8, 2021

@rjzamora what is the best way to contribute to the solution here. Seems like the issue is with pynvml and not with jupyterlab_nvdashboard. Is it worth opening an issue there?

Yes - This is likely a pynvml issue. The best way to contribute is to investigate the relavent pynvml bindings in a cuda-11 environment. I have not personally prioritized this yet, because I do not have easy access to a system with cuda-11 (but I do have one being shipped to me soon)

@pentschev
Copy link
Member

#133 should hopefully fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants