Log restart reasons and extra compile time in CompilationMetrics #121827

jamesjwu · 2024-03-13T16:52:42Z

Stack from ghstack (oldest at bottom):

-> Log restart reasons and extra compile time in CompilationMetrics #121827

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-03-13T16:52:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121827

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ef89b93 with merge base failed to retrieve merge base, please contact dev infra ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jamesjwu · 2024-03-13T16:54:48Z

torch/_dynamo/convert_frame.py

        transform: Callable[[List[Instruction], Dict[str, Any]], Any],
    ) -> Optional[GuardedCode]:
        nonlocal output
+        nonlocal wasted_compile_time


TODO: "wasted" might not be the right word to describe the time spent on an attempt that triggered a restart. Maybe "extra_compile_time"? Open to suggestions

I think restart_count would be a better metrics, as one restart take 100 sec is better than 10 restarts takes 100 sec. We should figure out why it restarts so many times. If only one restart but take too long time, this means the dynamo tracing of this frame is pretty slow, we can get the signal from entire_frame_compile_time - backend_compile_time.

I don't like "wasted" either. Maybe dynamo_time_before_restart

[ghstack-poisoned]

ghstack-source-id: 5e4da62 Pull Request resolved: #121827

yanboliang · 2024-03-13T23:36:01Z

torch/_dynamo/convert_frame.py

+                    graph_break_reason = output.compile_subgraph_reason.reason
+                else:
+                    graph_break_reason = None
+


I don't see a strong reason to have a separate graph_break_reason, can we just use the fail_reason? I think if this is a graph break, it should be the same?

fail_reason is used when the entire frame falls back to eager, which to me seems worse than restarting the compilation, successfully compiling but just producing two graphs.

Happy to just use fail_reason for both for measuring graph breaks, but would it ever be confusing to distinguish between failing the entire frame (and, in cases when suppress_errors=False, hard failing) vs. inserting a graph break?

You can still tell by whether or not fail_type or entire_frame_compile_time exists, though, so I suppose you don't necessarily need it?

yanboliang · 2024-03-13T23:38:55Z

torch/_dynamo/convert_frame.py

        transform: Callable[[List[Instruction], Dict[str, Any]], Any],
    ) -> Optional[GuardedCode]:
        nonlocal output
+        nonlocal wasted_compile_time


I think restart_count would be a better metrics, as one restart take 100 sec is better than 10 restarts takes 100 sec. We should figure out why it restarts so many times. If only one restart but take too long time, this means the dynamo tracing of this frame is pretty slow, we can get the signal from entire_frame_compile_time - backend_compile_time.

jamesjwu · 2024-03-14T13:41:49Z

I do think wasted tracing time is actually a good thing to measure because it can be aggregated as a % of total tracing time that we've spent: it's true that code that takes longer to trace affects the metric more, but that's even more reason to measure and minimize restarts on code that takes longer to trace.

Correct me if I'm wrong, but I think there can really only be at most one restart due to a graph break per frame. Whenever we graph break, because we restart on the first failed branch and make a new frame/compilation metric after the break. So we can already measure restart_count by the count of compilation events where there exists a graph_break_reason (or, if we just use fail_reason, the presence of a fail_reason and a successful compile).

ezyang

This all seems fine, just bikeshedding.

ezyang · 2024-03-15T05:01:35Z

torch/_dynamo/convert_frame.py

        transform: Callable[[List[Instruction], Dict[str, Any]], Any],
    ) -> Optional[GuardedCode]:
        nonlocal output
+        nonlocal wasted_compile_time


I don't like "wasted" either. Maybe dynamo_time_before_restart

ezyang · 2024-03-15T05:03:14Z

torch/_dynamo/convert_frame.py

                    "Restarting analysis due to %s",
                    LazyString(format_traceback_short, e.__traceback__),
                )
+                wasted_compile_time += time.time() - attempt_start_time


It might be more robust to just compute only three timestamps: start, end, and start of the last restart before we succeeded.

[ghstack-poisoned]

jamesjwu · 2024-03-15T14:14:12Z

@yanboliang and I chatted offline and he suggested recording all of the reasons for a restart in case there are multiple restart reasons, instead of just the last graph break reason, so updated the PR to reflect that. In the case of graph breaks, we expect the length of restart reasons to be at most 1, but it can't really hurt to log cases where we restart lots of times

[ghstack-poisoned]

ghstack-source-id: 9267bae Pull Request resolved: #121827

jamesjwu · 2024-03-16T01:28:20Z

@pytorchbot merge -i

pytorchmergebot · 2024-03-16T01:30:10Z

Merge started

Your change will be merged while ignoring the following 5 checks: inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-03-16T01:41:00Z

Merge failed

Reason: 2 jobs have failed, first few of them are: .github/workflows/inductor.yml / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), .github/workflows/inductor.yml / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

jamesjwu · 2024-03-18T14:15:23Z

@pytorchbot rebase

pytorchmergebot · 2024-03-18T14:16:55Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-03-18T14:17:15Z

Successfully rebased gh/jamesjwu/17/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/121827)

ghstack-source-id: ce05929 Pull Request resolved: #121827

jamesjwu · 2024-03-18T18:57:30Z

@pytorchbot merge

pytorchmergebot · 2024-03-18T18:59:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…1827) Summary: X-link: pytorch/pytorch#121827 Approved by: https://github.com/ezyang, https://github.com/yanboliang Reviewed By: huydhn Differential Revision: D55046110 Pulled By: jamesjwu fbshipit-source-id: aadb102f6a0a8626c0447209c9288cdf9307a152

Update

747b09e

[ghstack-poisoned]

github-actions bot added module: dynamo ciflow/inductor labels Mar 13, 2024

jamesjwu changed the title ~~Log graph break reasons and wasted compile time in compilationmetrics~~ Log graph break reasons and wasted compile time in CompilationMetrics Mar 13, 2024

jamesjwu commented Mar 13, 2024

View reviewed changes

jamesjwu requested a review from yanboliang March 13, 2024 17:05

jamesjwu marked this pull request as ready for review March 13, 2024 17:05

jamesjwu requested a review from oulgen March 13, 2024 17:05

jamesjwu added the topic: not user facing topic category label Mar 13, 2024

jamesjwu requested a review from ezyang March 13, 2024 17:06

Update

50aed8e

[ghstack-poisoned]

Update

e8b2355

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request Mar 13, 2024

Log graph break reasons and wasted compile time in compilationmetrics

ac2b162

ghstack-source-id: 5e4da62 Pull Request resolved: #121827

yanboliang reviewed Mar 13, 2024

View reviewed changes

jamesjwu requested a review from yanboliang March 14, 2024 17:38

ezyang approved these changes Mar 15, 2024

View reviewed changes

Update

79b9d2e

[ghstack-poisoned]

Use last attempt restart time to calculate restart time

bd4e6e5

[ghstack-poisoned]

jamesjwu changed the title ~~Log graph break reasons and wasted compile time in CompilationMetrics~~ Log restart reasons and extra compile time in CompilationMetrics Mar 15, 2024

Remove extra print statement

1f66861

[ghstack-poisoned]

Update

61c037d

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request Mar 15, 2024

Log graph break reasons and wasted compile time in compilationmetrics

11b6990

ghstack-source-id: 9267bae Pull Request resolved: #121827

yanboliang approved these changes Mar 15, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 16, 2024

pytorchmergebot added the merging label Mar 16, 2024

pytorchmergebot removed the merging label Mar 16, 2024

Update

ef89b93

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Mar 18, 2024

Log graph break reasons and wasted compile time in compilationmetrics

ec925f8

ghstack-source-id: ce05929 Pull Request resolved: #121827

pytorchmergebot added the merging label Mar 18, 2024

pytorchmergebot added the Merged label Mar 18, 2024

pytorchmergebot closed this in df1cdae Mar 18, 2024

pytorchmergebot removed the merging label Mar 18, 2024

github-actions bot deleted the gh/jamesjwu/17/head branch April 18, 2024 01:53

Log restart reasons and extra compile time in CompilationMetrics #121827

Log restart reasons and extra compile time in CompilationMetrics #121827

Uh oh!

Conversation

jamesjwu commented Mar 13, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121827

✅ No Failures

Uh oh!

jamesjwu Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanboliang Mar 13, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang Mar 15, 2024

Choose a reason for hiding this comment

Uh oh!

yanboliang Mar 13, 2024

Choose a reason for hiding this comment

Uh oh!

jamesjwu Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

yanboliang Mar 13, 2024

Choose a reason for hiding this comment

Uh oh!

jamesjwu commented Mar 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang Mar 15, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang Mar 15, 2024

Choose a reason for hiding this comment

Uh oh!

jamesjwu commented Mar 15, 2024

Uh oh!

jamesjwu commented Mar 16, 2024

Uh oh!

pytorchmergebot commented Mar 16, 2024

Merge started

Uh oh!

pytorchmergebot commented Mar 16, 2024

Merge failed

Uh oh!

jamesjwu commented Mar 18, 2024

Uh oh!

pytorchmergebot commented Mar 18, 2024

Uh oh!

pytorchmergebot commented Mar 18, 2024

Uh oh!

jamesjwu commented Mar 18, 2024

Uh oh!

pytorchmergebot commented Mar 18, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jamesjwu commented Mar 13, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 13, 2024 •

edited

Loading

jamesjwu Mar 13, 2024 •

edited

Loading

jamesjwu commented Mar 14, 2024 •

edited

Loading