Skip to content

Conversation

oulgen
Copy link
Contributor

@oulgen oulgen commented Aug 16, 2024

Copy link

pytorch-bot bot commented Aug 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133722

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8bd1758 with merge base 0063e56 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

oulgen added a commit that referenced this pull request Aug 16, 2024
ghstack-source-id: 6fa9680
Pull Request resolved: #133722
@oulgen oulgen added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Aug 16, 2024
cache_event_time = time_ns()
if (time_taken_ns := compiled_graph._time_taken_ns) is not None:
cache_info["time_saved_ns"] = time_taken_ns
if (time_saved_ns := compiled_graph._time_taken_ns) is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for this logic? (I'm predicting some crazy bug in the future, where units get mixed up, and we end up with like 1e9 seconds passed to increase_timeout).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a test for this logic, I added it this week:

def test_asymmetric_compilation_with_fx_cache(self):

It was not being executed in Fbcode due to missing a TARGETS file as we discussed earlier, also fixed.

if not torch.distributed.is_available() or not torch.distributed.is_initialized():
return 0

increased_timeout_sec = int(time_saved_ns // 1e9) # convert to seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could also do time_saved_ns // 10**9 instead...

def add_ephemeral_timeout_increase_for_distributed(time_saved_ns: int) -> int:
"""
Ephemerally increases the NCCL timeout when compiling for a distributed job
Returns amount of seconds increased
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nanoseconds in, seconds out?

Copy link
Contributor Author

@oulgen oulgen Aug 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking eventually we would use this function for all ephemeral timeouts and we consistently use time.time_ns(), the output is only for logging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you think i should convert to seconds before passing to this function, that i can do, but it will result in less code sharing

compiled_graph.current_callable = new_callable


def add_ephemeral_timeout_increase_for_distributed(time_saved_ns: int) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want seconds as an int and not a float? (this applies to all the int casting below...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.time_ns() returns an int, and in order to write to scuba we need them to be ints anyway

@oulgen
Copy link
Contributor Author

oulgen commented Aug 17, 2024

@oulgen has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@oulgen
Copy link
Contributor Author

oulgen commented Aug 17, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/oulgen/119/head branch September 28, 2024 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants