New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
TORCH_DISTRIBUTED_DEBUG not effective #70667
Comments
I think you probably might have to compile with |
I'm having this same issue, but somehow getting even less information. Running the exact code above, no debug info is printed, only
Version info:
|
set NCCL_DEBUG=INFO, you should get exactly the same output with mine. |
Thanks for pointing this out! It indeed looks like the USE_GLOG is set to off in prebuilt PyTorch binaries and thus calls like
@malfet - is there any context as to why PyTorch prebuilt binaries have USE_GLOG and thus LOG calls off and no way to turn them on? Is there a workaround or could we possibly consider building with USE_GLOG on for the prebuilt binaries? In the short-term, we can resolve this by logging to stderr, although writing these logs with LOG(INFO) would be preferable. |
cc @cbalioglu |
See #68226 as well. |
GLOG was disabled by #16789 , you can probably ask @soumith for the reason why, but my guess is that it introduces lots of runtime dependencies (both glog, gflags and protobuf are notoriously hard to integrate so that they would not conflict with other installations of the libraries in the system. It would be a great BE effort to unify multiple PyTorch logging primitives together, such as base c10 logging and systems like Please note, that debug level logging is likely compiled out from release build for performance reason, as even a dummy function call is expensive in the code that is called billions of times |
It looks like easiest solution for Line 127 in 9d47652
to say here pytorch/torch/csrc/distributed/c10d/logger.cpp Lines 380 to 382 in 9d47652
|
@cbalioglu has fixed this with #71746 |
Fixes #70667 `TORCH_CPP_LOG_LEVEL=INFO` is needed for `TORCH_DISTRIBUTED_DEBUG` to be effective. For reference, #71746 introduced the environment variable `TORCH_CPP_LOG_LEVEL` and #73361 documented it. Pull Request resolved: #76625 Approved by: https://github.com/rohan-varma
Fixes #70667 `TORCH_CPP_LOG_LEVEL=INFO` is needed for `TORCH_DISTRIBUTED_DEBUG` to be effective. For reference, #71746 introduced the environment variable `TORCH_CPP_LOG_LEVEL` and #73361 documented it. Approved by: https://github.com/rohan-varma [ghstack-poisoned]
Summary: Fixes #70667 `TORCH_CPP_LOG_LEVEL=INFO` is needed for `TORCH_DISTRIBUTED_DEBUG` to be effective. For reference, #71746 introduced the environment variable `TORCH_CPP_LOG_LEVEL` and #73361 documented it. Pull Request resolved: #76625 Approved by: https://github.com/rohan-varma Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/9c902f4749a7b023cb5a6f97b943ba65796418c3 Reviewed By: malfet Differential Revision: D36134083 fbshipit-source-id: 15fb58706412d3af029d5544654c70cb85670d6f
馃悰 Describe the bug
I'M testING TORCH_DISTRIBUTED_DEBUG according to this documents: https://pytorch.org/docs/master/distributed.html#debugging-torch-distributed-applications
The code above is from this document and is identical
but the output has no ddp debug information
Do I need to successfully configure USE_GLOG at compile time or specify a CAFFE2_LOG_THRESHOLD of no less than INFO in order to print these messages? If so, I suggest refine the documentation here.
The PyTorch build used:
Versions
cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang
The text was updated successfully, but these errors were encountered: