New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a logging framework #671

Merged
merged 19 commits into from Jan 4, 2019

Conversation

Projects
None yet
3 participants
@rahul003
Copy link
Contributor

rahul003 commented Dec 1, 2018

Adds a logging framework with different levels, which the user can activate by using the environment variable HOROVOD_LOG_LEVEL. It takes the values as follows:
fatal, error, warning : Warnings like stalled tensors, etc (default)
info: Logs the configuration and initialization and shutdown,etc
debug: Logs how many tensors are being fused, what size, etc
trace : Logs names of tensors being reduced in addition to the above

This PR also adds some logs at the appropriate level. The intention is to allow users (or developers) to take a closer look at Horovod operations when they face a crash or hang. There is no performance effect to this (unless you turn on trace maybe, but even that was barely noticeable in my experience). If you think the level of some statement should be adjusted, please let me know.

Printing logs and controlling the levels should be useful for developers as well when developing a new feature. This adds two macros for logging which developers can use.

  • LOG(INFO) << "message";
  • LOG(INFO, rank) << "message";

The output is similar to how TensorFlow logs operations.

Sample output (the [1] refers to rank1)

[2018-12-01 00:12:30.893028: I horovod/common/operations.cc:1498] Started Horovod with 8 processes
[2018-12-01 00:12:30.893390: I horovod/common/operations.cc:1629] [1]: Horovod Initialized
[2018-12-01 00:12:30.893397: I horovod/common/operations.cc:1629] [2]: Horovod Initialized
[2018-12-01 00:12:30.893399: I horovod/common/operations.cc:1629] [3]: Horovod Initialized
[2018-12-01 00:12:30.893399: I horovod/common/operations.cc:1629] [4]: Horovod Initialized
[2018-12-01 00:12:30.893393: I horovod/common/operations.cc:1629] [5]: Horovod Initialized
[2018-12-01 00:12:30.893397: I horovod/common/operations.cc:1629] [6]: Horovod Initialized
[2018-12-01 00:12:30.893406: I horovod/common/operations.cc:1629] [7]: Horovod Initialized
[2018-12-01 00:12:30.893393: I horovod/common/operations.cc:1619] Using fusion threshold of 16777216
[2018-12-01 00:12:30.893424: I horovod/common/operations.cc:1629] [0]: Horovod Initialized
[2018-12-01 00:11:18.020152: D horovod/common/operations.cc:1758] [6]: Sent 5 messages
[2018-12-01 00:11:18.024032: D horovod/common/operations.cc:1758] [4]: Sent 10 messages
[2018-12-01 00:11:18.024160: D horovod/common/operations.cc:1758] [2]: Sent 5 messages
[2018-12-01 00:11:18.024365: D horovod/common/operations.cc:1887] Created response of size 12410784
[2018-12-01 00:11:18.024402: D horovod/common/operations.cc:1887] Created response of size 9441280

There's also another environment variable which disables the printing of the time. (HOROVOD_LOG_HIDE_TIME=1)
The output would then look like:

[I horovod/common/operations.cc:1498] Started Horovod with 8 processes
[I horovod/common/operations.cc:1619] Using fusion threshold of 16777216
[I horovod/common/operations.cc:1629] [0]: Horovod Initialized
[I horovod/common/operations.cc:1629] [1]: Horovod Initialized
[I horovod/common/operations.cc:1629] [2]: Horovod Initialized
[I horovod/common/operations.cc:1629] [3]: Horovod Initialized
[I horovod/common/operations.cc:1629] [4]: Horovod Initialized
[I horovod/common/operations.cc:1629] [5]: Horovod Initialized
[I horovod/common/operations.cc:1629] [6]: Horovod Initialized
[I horovod/common/operations.cc:1629] [7]: Horovod Initialized

@alsrgv alsrgv self-requested a review Dec 11, 2018

@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Dec 11, 2018

Thanks, @rahul003 - this is great! Could you rebase and annotate new std::cerr logs added in #615 as well?

@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Dec 18, 2018

@rahul003, bump - would be great to merge this change ASAP.

@alsrgv alsrgv referenced this pull request Dec 18, 2018

Open

MLSL integration #573

@rahul003

This comment has been minimized.

Copy link
Contributor

rahul003 commented Dec 19, 2018

Sorry for the delay, was traveling. On it

@rahul003 rahul003 force-pushed the rahul003:logging branch from 96c7da6 to b064526 Dec 19, 2018

@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Dec 19, 2018

@rahul003, can you fix unit tests?

@tgaddair
Copy link
Collaborator

tgaddair left a comment

LGTM, thanks!

@rahul003

This comment has been minimized.

Copy link
Contributor

rahul003 commented Dec 19, 2018

@alsrgv I've rebased and updated some logging statements. I've fixed the issue causing crash in the builds. I'll keep a watch and ensure they go through.
Thanks @tgaddair!

@rahul003

This comment has been minimized.

Copy link
Contributor

rahul003 commented Dec 20, 2018

Any idea what might be up with the tests failing for torch 0.4.x with python 3.x. It doesn't feel like that might be related to this PR, is it?

@alsrgv
Copy link
Collaborator

alsrgv left a comment

I re-ran the tests, but they all failed persistently with segfault. Can you try making a docker with failing configuration and debug the segfault?

Show resolved Hide resolved horovod/common/operations.cc Outdated
Show resolved Hide resolved horovod/common/operations.cc Outdated
@rahul003

This comment has been minimized.

Copy link
Contributor

rahul003 commented Dec 20, 2018

Update: I have NOW reproduced the segfault locally and am debugging it

@rahul003

This comment has been minimized.

Copy link
Contributor

rahul003 commented Dec 20, 2018

Interestingly, I'm seeing this crash even with pip install horovod. Also, this error showed up when I did pytest -v. When I do pytest test_torch.py -v, then this particular test doesn't show any error. The memory corruption seems to happen over time, but consistently at this test.

Crash message

@alsrgv
Copy link
Collaborator

alsrgv left a comment

@rahul003, thanks for the investigation. Do you get a core dump? Can you get thread apply all bt output from gdb?

Also, I did read code again and left a few comments.

Show resolved Hide resolved horovod/common/logging.h Outdated
Show resolved Hide resolved horovod/common/logging.cc Outdated
Show resolved Hide resolved horovod/common/logging.cc Outdated
Show resolved Hide resolved horovod/common/mpi_message.cc Outdated
Show resolved Hide resolved horovod/common/logging.cc Outdated
Show resolved Hide resolved horovod/common/logging.cc
Show resolved Hide resolved horovod/common/logging.cc Outdated
Show resolved Hide resolved horovod/common/logging.cc Outdated
Show resolved Hide resolved horovod/common/operations.cc
@alsrgv
Copy link
Collaborator

alsrgv left a comment

Could you rebase on latest changes I just pushed? It may help with debugging.

@rahul003

This comment has been minimized.

Copy link
Contributor

rahul003 commented Dec 27, 2018

Thanks created a separate issue for the crash here #724
Trying to debug that using master. Will update this PR based on your comments

@rahul003 rahul003 force-pushed the rahul003:logging branch from f2310e0 to b6db77a Dec 27, 2018

Show resolved Hide resolved horovod/common/operations.cc
Show resolved Hide resolved horovod/common/mpi_message.cc Outdated
Show resolved Hide resolved horovod/common/logging.cc
Show resolved Hide resolved horovod/common/logging.cc Outdated
@alsrgv

This comment has been minimized.

Copy link
Collaborator

alsrgv commented Jan 3, 2019

@rahul003, I've landed #730, could you try rebasing this branch?

@rahul003 rahul003 force-pushed the rahul003:logging branch from cf2e007 to 8e43148 Jan 3, 2019

@alsrgv

alsrgv approved these changes Jan 3, 2019

Copy link
Collaborator

alsrgv left a comment

LGTM! Please let me know if it's fully ready for merge.

@rahul003

This comment has been minimized.

Copy link
Contributor

rahul003 commented Jan 4, 2019

Yeah, I think it's good to go!

@alsrgv alsrgv merged commit 6a933cd into uber:master Jan 4, 2019

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
license/cla Contributor License Agreement is signed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment