-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add a fudge factor to ephemeral NCCL timeout increase #133722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133722
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 8bd1758 with merge base 0063e56 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cache_event_time = time_ns() | ||
if (time_taken_ns := compiled_graph._time_taken_ns) is not None: | ||
cache_info["time_saved_ns"] = time_taken_ns | ||
if (time_saved_ns := compiled_graph._time_taken_ns) is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test for this logic? (I'm predicting some crazy bug in the future, where units get mixed up, and we end up with like 1e9 seconds passed to increase_timeout).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a test for this logic, I added it this week:
def test_asymmetric_compilation_with_fx_cache(self): |
It was not being executed in Fbcode due to missing a TARGETS file as we discussed earlier, also fixed.
if not torch.distributed.is_available() or not torch.distributed.is_initialized(): | ||
return 0 | ||
|
||
increased_timeout_sec = int(time_saved_ns // 1e9) # convert to seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could also do time_saved_ns // 10**9
instead...
def add_ephemeral_timeout_increase_for_distributed(time_saved_ns: int) -> int: | ||
""" | ||
Ephemerally increases the NCCL timeout when compiling for a distributed job | ||
Returns amount of seconds increased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nanoseconds in, seconds out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking eventually we would use this function for all ephemeral timeouts and we consistently use time.time_ns(), the output is only for logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you think i should convert to seconds before passing to this function, that i can do, but it will result in less code sharing
compiled_graph.current_callable = new_callable | ||
|
||
|
||
def add_ephemeral_timeout_increase_for_distributed(time_saved_ns: int) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want seconds as an int
and not a float
? (this applies to all the int
casting below...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.time_ns() returns an int, and in order to write to scuba we need them to be ints anyway
@oulgen has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang
Differential Revision: D61422431