-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Log restart reasons and extra compile time in CompilationMetrics #121827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121827
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ef89b93 with merge base failed to retrieve merge base, please contact dev infra ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
torch/_dynamo/convert_frame.py
Outdated
transform: Callable[[List[Instruction], Dict[str, Any]], Any], | ||
) -> Optional[GuardedCode]: | ||
nonlocal output | ||
nonlocal wasted_compile_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: "wasted" might not be the right word to describe the time spent on an attempt that triggered a restart. Maybe "extra_compile_time"? Open to suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think restart_count
would be a better metrics, as one restart take 100 sec is better than 10 restarts takes 100 sec. We should figure out why it restarts so many times. If only one restart but take too long time, this means the dynamo tracing of this frame is pretty slow, we can get the signal from entire_frame_compile_time - backend_compile_time
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like "wasted" either. Maybe dynamo_time_before_restart
torch/_dynamo/convert_frame.py
Outdated
graph_break_reason = output.compile_subgraph_reason.reason | ||
else: | ||
graph_break_reason = None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a strong reason to have a separate graph_break_reason, can we just use the fail_reason? I think if this is a graph break, it should be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fail_reason is used when the entire frame falls back to eager, which to me seems worse than restarting the compilation, successfully compiling but just producing two graphs.
Happy to just use fail_reason for both for measuring graph breaks, but would it ever be confusing to distinguish between failing the entire frame (and, in cases when suppress_errors=False, hard failing) vs. inserting a graph break?
You can still tell by whether or not fail_type
or entire_frame_compile_time
exists, though, so I suppose you don't necessarily need it?
torch/_dynamo/convert_frame.py
Outdated
transform: Callable[[List[Instruction], Dict[str, Any]], Any], | ||
) -> Optional[GuardedCode]: | ||
nonlocal output | ||
nonlocal wasted_compile_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think restart_count
would be a better metrics, as one restart take 100 sec is better than 10 restarts takes 100 sec. We should figure out why it restarts so many times. If only one restart but take too long time, this means the dynamo tracing of this frame is pretty slow, we can get the signal from entire_frame_compile_time - backend_compile_time
.
I do think wasted tracing time is actually a good thing to measure because it can be aggregated as a % of total tracing time that we've spent: it's true that code that takes longer to trace affects the metric more, but that's even more reason to measure and minimize restarts on code that takes longer to trace. Correct me if I'm wrong, but I think there can really only be at most one restart due to a graph break per frame. Whenever we graph break, because we restart on the first failed branch and make a new frame/compilation metric after the break. So we can already measure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all seems fine, just bikeshedding.
torch/_dynamo/convert_frame.py
Outdated
transform: Callable[[List[Instruction], Dict[str, Any]], Any], | ||
) -> Optional[GuardedCode]: | ||
nonlocal output | ||
nonlocal wasted_compile_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like "wasted" either. Maybe dynamo_time_before_restart
torch/_dynamo/convert_frame.py
Outdated
"Restarting analysis due to %s", | ||
LazyString(format_traceback_short, e.__traceback__), | ||
) | ||
wasted_compile_time += time.time() - attempt_start_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be more robust to just compute only three timestamps: start, end, and start of the last restart before we succeeded.
@yanboliang and I chatted offline and he suggested recording all of the reasons for a restart in case there are multiple restart reasons, instead of just the last graph break reason, so updated the PR to reflect that. In the case of graph breaks, we expect the length of restart reasons to be at most 1, but it can't really hurt to log cases where we restart lots of times |
[ghstack-poisoned]
[ghstack-poisoned]
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 5 checks: inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: .github/workflows/inductor.yml / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), .github/workflows/inductor.yml / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…1827) Summary: X-link: pytorch/pytorch#121827 Approved by: https://github.com/ezyang, https://github.com/yanboliang Reviewed By: huydhn Differential Revision: D55046110 Pulled By: jamesjwu fbshipit-source-id: aadb102f6a0a8626c0447209c9288cdf9307a152
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang