-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVLink Throughput and Timeline throwing 500's and Errors in lab interface #28
Comments
Thanks for raising this! Yeah it looks like we should catch that |
I can confirm, that this is an issue for everyone without NVLink.
|
I am running the latest version of this dashbaord on a DGX Station with nvlink and I am still seeing this error. Is there a specific driver version I need? It looks like I am seeing an issue with a different metric than the previous user
Running this Dockerfile:
|
I plan to take some time to look into the various issues we are seeing for NVLink metrics in pynvml this week. If the NVML API has indeed changed, we should certainly update the wrappers. Overall, I do not think the "NVLink Throughput" dashboard has ever been very reliable, and I'd like to improve behavior on the backend. Also, I definitley think it makes sense for us to inform the user more gracefully if/when NVLink metrics are not supported. |
I have also observed this issue on a system with NVLINK after our system was upgraded with new drivers and CUDA Version.
|
@rjzamora is there anyway I can help on this? My HPC center recently upgraded to CUDA Version: 11.0 on the system and this seems to have broken the NVLINK dashboards for reasons that I don't understand. This functionality was super helpful for profiling and optimizing distributed deep learning workloads on our system. Is this an issue with |
Sorry for letting this slip, and for missing your last comment @davidrpugh ! The only reason I haven't looked into this yet is that I don't personally have access to a machine with cuda 11 (so I have not personally observed the problem). This seems like more motivation for gpu CI :) I will try harder to get access to something remotely, and I should be getting a local machine with 11 support in a few weeks. Others who already have access to 11 should feel free to look into this. |
@rjzamora what is the best way to contribute to the solution here. Seems like the issue is with |
Yes - This is likely a pynvml issue. The best way to contribute is to investigate the relavent pynvml bindings in a cuda-11 environment. I have not personally prioritized this yet, because I do not have easy access to a system with cuda-11 (but I do have one being shipped to me soon) |
#133 should hopefully fix this. |
This is an awesome project. Thanks for the hard work here! It's really nice to have a dashboard for watching GPU resources (and is way better than opening up a terminal and running
watch nvidia-smi
馃榾 )I'll preface this issue with this is mostly just some user feedback. Do with it what you will :). I'm happy to help debug further, but have zero ability to actually write JLab extensions so can't help on the writing-code-to-help-fix side of things.
I'm brand new to using this extension (installed it like 15 mins ago) and was clicking around seeing what all the different dashboards do. When I open up the NVLink Throughput and NVLink Timeline dashboards, I immediately get stack traces in my jupyter server logs and a "500: Internal Server Error" in the jupyterlab widget. This is almost certainly because I'm not running on a multi-gpu system.
NVLink jupterlab 500 in the opened panel:
server logs from NVLink error:
Oh, interesting. Every time I switch between different tabs in jupyterlab, it seems like the dashboard needs to reconnect to the websocket. Sometimes this also throws an exception in the jupyter server logs. (Clearly the workaround is to have all of the dashboards exposed and not in tabs)
Websocket error:
Current environment:
Installed extension with
pip install jupyterlab-nvdashboard
and thenjupyter labextension install jupyterlab-nvdashboard
The text was updated successfully, but these errors were encountered: