[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training #130005

shunting314 · 2024-07-03T00:43:37Z

Stack from ghstack (oldest at bottom):

This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful.

Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08:

What's nice is the dashboard shows the nightly commits for each run.

Running

git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/

Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df

Roughly looking thru the PRs, I feel

ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451)

can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 )

Horace's PR (#124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change.

Since this is not a real issue, I'll raise the tolerance to make it pass.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang

…ining [ghstack-poisoned]

pytorch-bot · 2024-07-03T00:43:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130005

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 10b278f with merge base 30fc4b0 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… during training" This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3a..0e6c204 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (#124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]

…ining ghstack-source-id: 9e44b4b Pull Request resolved: #130005

eellison · 2024-07-03T01:15:12Z

cc @Chillee any reason you would expect recomputation changes to have real numerics issues ?

Chillee · 2024-07-03T02:39:19Z

@shunting314 I wonder whether we start pattern-matching F.scaled_dot_product_attention in this case?

shunting314 · 2024-07-03T21:58:01Z

I wonder whether we start pattern-matching F.scaled_dot_product_attention in this case?

I checked the wrapper generated with and without the remove_noop_ops call, in either case, the wrapper contains 12 calls for the scaled_dot_product_attention kernel in fwd wrapper and the corresponding backward kernel in the backward wrapper.

Here are some pastes for reference

fwd wrapper without the remove_noop_ops call: https://gist.github.com/shunting314/df21af4ed4c4e1bb9b329cbc6cf64102
fwd wrapper with the remove_noop_ops call: https://gist.github.com/shunting314/4ea8ac8fbd5c0de9addfc5368ae501d2

… during training" This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3a..0e6c204 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (#124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]

…ining ghstack-source-id: 537d969 Pull Request resolved: #130005

shunting314 · 2024-07-03T22:04:23Z

@pytorchbot merge

pytorchmergebot · 2024-07-03T22:06:46Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-03T22:48:12Z

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / rocm6.1-py3.8-inductor / test (inductor, 1, 2, linux.rocm.gpu.2)

Details for Dev Infra team

Raised by workflow job

shunting314 · 2024-07-03T22:52:00Z

@pytorchbot merge -i

pytorchmergebot · 2024-07-03T22:53:36Z

Merge started

Your change will be merged while ignoring the following 3 checks: inductor / rocm6.1-py3.8-inductor / test (inductor, 1, 2, linux.rocm.gpu.2), trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-13), trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ring training (#130005)" This reverts commit 0af8c8a. Reverted #130005 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](#129996 (comment)))

pytorchmergebot · 2024-07-04T14:55:41Z

@shunting314 your PR has been successfully reverted.

… during training" This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3a..0e6c204 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (#124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]

shunting314 · 2024-07-05T06:52:12Z

@pytorchbot merge

pytorchmergebot · 2024-07-05T06:54:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-05T07:00:31Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

… during training" This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3a..0e6c204 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (#124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]

…ining ghstack-source-id: a27c510 Pull Request resolved: #130005

shunting314 · 2024-07-05T07:18:41Z

@pytorchbot merge

pytorchmergebot · 2024-07-05T07:21:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Try to fix #130161 The reason that `--accuracy` works is we use eval mode. While `--training` does not work since we use training mode but TorchBench does not return targets tenors. In training mode, vision_maskrcnn requires targets tensors I fix that to always use eval mode for vision_maskrcnn training. With the fix, I start see a segfault: https://gist.github.com/shunting314/5a70df3463b2a4421b2c34aa88e78d1f I'm not sure if that's due to my local setup but I think the fix in this PR is something we need any way. We can check the dashboard after the PR is in. Pull Request resolved: #130163 Approved by: https://github.com/jansel ghstack dependencies: #129996, #129941, #130005

The training accuracy for this model starts to regress. It does not show up on the weekly run yet but 1. it shows up in my MA runs [here](https://hud.pytorch.org/benchmark/torchbench/inductor_max_autotune?dashboard=torchinductor&startTime=Fri,%2028%20Jun%202024%2006:53:45%20GMT&stopTime=Fri,%2005%20Jul%202024%2006:53:45%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/shunting314/162/head&lCommit=cb236e8c198b54901e4fb19698f91be786f72e25&rBranch=main&rCommit=4ee1cb9b955fcc5d75a421b19393998122136f2c) 2. I can repro it locally Command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --training --amp --backend inductor --device cuda --only squeezenet1_1 ``` Raise the tolerance to fix. Pull Request resolved: #130165 Approved by: https://github.com/jansel ghstack dependencies: #129996, #129941, #130005, #130163

Summary: This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (pytorch/pytorch#124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. X-link: pytorch/pytorch#130005 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #129996, #129941 Reviewed By: kit1980 Differential Revision: D59413523 Pulled By: shunting314 fbshipit-source-id: d4d678b000bf497d1f48a3c74032bbd4d08aa5ac

[pt2-bench] fix accuracy failure for beit_base_patch16_224 during tra…

40c8e43

…ining [ghstack-poisoned]

shunting314 mentioned this pull request Jul 3, 2024

[pt2-bench] fix accuracy failure for a few models #129941

Closed

pytorch-bot bot added ciflow/inductor module: dynamo labels Jul 3, 2024

shunting314 requested review from Chillee, eellison and jansel July 3, 2024 00:51

shunting314 mentioned this pull request Jul 3, 2024

[pt2-bench] pass acc test if ref is NaN #129996

Closed

shunting314 added a commit that referenced this pull request Jul 3, 2024

[pt2-bench] fix accuracy failure for beit_base_patch16_224 during tra…

cb000ee

…ining ghstack-source-id: 9e44b4b Pull Request resolved: #130005

eellison approved these changes Jul 3, 2024

View reviewed changes

jansel approved these changes Jul 3, 2024

View reviewed changes

shunting314 added the topic: not user facing topic category label Jul 3, 2024

shunting314 added a commit that referenced this pull request Jul 3, 2024

[pt2-bench] fix accuracy failure for beit_base_patch16_224 during tra…

7edded8

…ining ghstack-source-id: 537d969 Pull Request resolved: #130005

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 3, 2024

pytorchmergebot added the merging label Jul 3, 2024

pytorchmergebot removed the merging label Jul 3, 2024

pytorchmergebot added the merging label Jul 3, 2024

pytorchmergebot added the Merged label Jul 4, 2024

pytorchmergebot closed this in 0af8c8a Jul 4, 2024

pytorchmergebot removed the merging label Jul 4, 2024

pytorchmergebot added the Reverted label Jul 4, 2024

pytorchmergebot reopened this Jul 4, 2024

pytorchmergebot added the merging label Jul 5, 2024

shunting314 added a commit that referenced this pull request Jul 5, 2024

[pt2-bench] fix accuracy failure for beit_base_patch16_224 during tra…

3894302

…ining ghstack-source-id: a27c510 Pull Request resolved: #130005

pytorchmergebot removed the merging label Jul 5, 2024

pytorchmergebot closed this in 8f6765f Jul 5, 2024

This was referenced Jul 5, 2024

[pt2-bench] use eval mode for vision_maskrcnn #130163

Closed

[pt2-bench] raise tolerance for squeezenet1_1 #130165

Closed

github-actions bot deleted the gh/shunting314/162/head branch August 5, 2024 01:58

[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training #130005

[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training #130005

Uh oh!

Conversation

shunting314 commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130005

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

eellison commented Jul 3, 2024

Uh oh!

Chillee commented Jul 3, 2024

Uh oh!

shunting314 commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shunting314 commented Jul 3, 2024

Uh oh!

pytorchmergebot commented Jul 3, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 3, 2024

Merge failed

Uh oh!

shunting314 commented Jul 3, 2024

Uh oh!

pytorchmergebot commented Jul 3, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 4, 2024

Uh oh!

shunting314 commented Jul 5, 2024

Uh oh!

pytorchmergebot commented Jul 5, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 5, 2024

Uh oh!

shunting314 commented Jul 5, 2024

Uh oh!

pytorchmergebot commented Jul 5, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shunting314 commented Jul 3, 2024 •

edited

Loading

pytorch-bot bot commented Jul 3, 2024 •

edited

Loading

shunting314 commented Jul 3, 2024 •

edited

Loading