Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Adds a logging framework #671
Adds a logging framework with different levels, which the user can activate by using the environment variable
This PR also adds some logs at the appropriate level. The intention is to allow users (or developers) to take a closer look at Horovod operations when they face a crash or hang. There is no performance effect to this (unless you turn on trace maybe, but even that was barely noticeable in my experience). If you think the level of some statement should be adjusted, please let me know.
Printing logs and controlling the levels should be useful for developers as well when developing a new feature. This adds two macros for logging which developers can use.
The output is similar to how TensorFlow logs operations.
Sample output (the  refers to rank1)
There's also another environment variable which disables the printing of the time. (HOROVOD_LOG_HIDE_TIME=1)
Interestingly, I'm seeing this crash even with pip install horovod. Also, this error showed up when I did pytest -v. When I do pytest test_torch.py -v, then this particular test doesn't show any error. The memory corruption seems to happen over time, but consistently at this test.