A short note from a talk about training LLMs that I just attended. The speaker from huggingface mentioned https://github.com/NVIDIA/DCGM as a very useful way to debug/diagnose problems. Maybe useful for rapids doctor?
Just wanted to make a note, if you already know it/dismissed it feel free to close the issue
A short note from a talk about training LLMs that I just attended. The speaker from huggingface mentioned https://github.com/NVIDIA/DCGM as a very useful way to debug/diagnose problems. Maybe useful for
rapids doctor?Just wanted to make a note, if you already know it/dismissed it feel free to close the issue