Skip to content

Conversation

jamesjwu
Copy link
Contributor

@jamesjwu jamesjwu commented Mar 13, 2024

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Mar 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121827

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ef89b93 with merge base failed to retrieve merge base, please contact dev infra (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jamesjwu jamesjwu changed the title Log graph break reasons and wasted compile time in compilationmetrics Log graph break reasons and wasted compile time in CompilationMetrics Mar 13, 2024
transform: Callable[[List[Instruction], Dict[str, Any]], Any],
) -> Optional[GuardedCode]:
nonlocal output
nonlocal wasted_compile_time
Copy link
Contributor Author

@jamesjwu jamesjwu Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: "wasted" might not be the right word to describe the time spent on an attempt that triggered a restart. Maybe "extra_compile_time"? Open to suggestions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think restart_count would be a better metrics, as one restart take 100 sec is better than 10 restarts takes 100 sec. We should figure out why it restarts so many times. If only one restart but take too long time, this means the dynamo tracing of this frame is pretty slow, we can get the signal from entire_frame_compile_time - backend_compile_time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like "wasted" either. Maybe dynamo_time_before_restart

@jamesjwu jamesjwu requested a review from yanboliang March 13, 2024 17:05
@jamesjwu jamesjwu marked this pull request as ready for review March 13, 2024 17:05
@jamesjwu jamesjwu requested a review from oulgen March 13, 2024 17:05
@jamesjwu jamesjwu added the topic: not user facing topic category label Mar 13, 2024
@jamesjwu jamesjwu requested a review from ezyang March 13, 2024 17:06
[ghstack-poisoned]
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull request Mar 13, 2024
graph_break_reason = output.compile_subgraph_reason.reason
else:
graph_break_reason = None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a strong reason to have a separate graph_break_reason, can we just use the fail_reason? I think if this is a graph break, it should be the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fail_reason is used when the entire frame falls back to eager, which to me seems worse than restarting the compilation, successfully compiling but just producing two graphs.

Happy to just use fail_reason for both for measuring graph breaks, but would it ever be confusing to distinguish between failing the entire frame (and, in cases when suppress_errors=False, hard failing) vs. inserting a graph break?

You can still tell by whether or not fail_type or entire_frame_compile_time exists, though, so I suppose you don't necessarily need it?

transform: Callable[[List[Instruction], Dict[str, Any]], Any],
) -> Optional[GuardedCode]:
nonlocal output
nonlocal wasted_compile_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think restart_count would be a better metrics, as one restart take 100 sec is better than 10 restarts takes 100 sec. We should figure out why it restarts so many times. If only one restart but take too long time, this means the dynamo tracing of this frame is pretty slow, we can get the signal from entire_frame_compile_time - backend_compile_time.

@jamesjwu
Copy link
Contributor Author

jamesjwu commented Mar 14, 2024

I do think wasted tracing time is actually a good thing to measure because it can be aggregated as a % of total tracing time that we've spent: it's true that code that takes longer to trace affects the metric more, but that's even more reason to measure and minimize restarts on code that takes longer to trace.

Correct me if I'm wrong, but I think there can really only be at most one restart due to a graph break per frame. Whenever we graph break, because we restart on the first failed branch and make a new frame/compilation metric after the break. So we can already measure restart_count by the count of compilation events where there exists a graph_break_reason (or, if we just use fail_reason, the presence of a fail_reason and a successful compile).

@jamesjwu jamesjwu requested a review from yanboliang March 14, 2024 17:38
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems fine, just bikeshedding.

transform: Callable[[List[Instruction], Dict[str, Any]], Any],
) -> Optional[GuardedCode]:
nonlocal output
nonlocal wasted_compile_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like "wasted" either. Maybe dynamo_time_before_restart

"Restarting analysis due to %s",
LazyString(format_traceback_short, e.__traceback__),
)
wasted_compile_time += time.time() - attempt_start_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more robust to just compute only three timestamps: start, end, and start of the last restart before we succeeded.

[ghstack-poisoned]
@jamesjwu
Copy link
Contributor Author

@yanboliang and I chatted offline and he suggested recording all of the reasons for a restart in case there are multiple restart reasons, instead of just the last graph break reason, so updated the PR to reflect that. In the case of graph breaks, we expect the length of restart reasons to be at most 1, but it can't really hurt to log cases where we restart lots of times

@jamesjwu jamesjwu changed the title Log graph break reasons and wasted compile time in CompilationMetrics Log restart reasons and extra compile time in CompilationMetrics Mar 15, 2024
[ghstack-poisoned]
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull request Mar 15, 2024
@jamesjwu
Copy link
Contributor Author

@pytorchbot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 16, 2024
@pytorchmergebot
Copy link
Collaborator

@jamesjwu
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/jamesjwu/17/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/121827)

pytorchmergebot pushed a commit that referenced this pull request Mar 18, 2024
@jamesjwu
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

facebook-github-bot pushed a commit to pytorch/benchmark that referenced this pull request Mar 19, 2024
…1827)

Summary:
X-link: pytorch/pytorch#121827
Approved by: https://github.com/ezyang, https://github.com/yanboliang

Reviewed By: huydhn

Differential Revision: D55046110

Pulled By: jamesjwu

fbshipit-source-id: aadb102f6a0a8626c0447209c9288cdf9307a152
@github-actions github-actions bot deleted the gh/jamesjwu/17/head branch April 18, 2024 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants