[inductor] CI improvments #91283

desertfire · 2022-12-21T23:41:26Z

Stack from ghstack (oldest at bottom):

-> [inductor] CI improvments #91283

Summary:

Setting torch.backends.cudnn.deterministic to True helps to
eliminate the eager_variance failures seen on CI
Skip Triton failure instead of retry
Some minor script cleanup is also included in this PR.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx

[ghstack-poisoned]

pytorch-bot · 2022-12-21T23:41:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91283

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 540110a:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: We need to set use_eval_mode when checking training accuracy, not only for inductor but also for eager, to avoid randomness from dropout. Some minor script clean up is also included in this PR. [ghstack-poisoned]

ngimel · 2022-12-22T01:42:03Z

cc @anijain2305. We can't set eval mode for timm models because that would disable bn testing, and we had serious bugs when we didn't test bn.
Instead, we should leave the mode as training, but set dropout to 0 to avoid non-determinism (also, my impression was that dropout is already 0 for most timm models by default)

Summary: Setting torch.backends.cudnn.deterministic to True helps to eliminate the eager_variance failures seen on CI. Some minor script clean up is also included in this PR. [ghstack-poisoned]

desertfire · 2022-12-22T04:02:46Z

cc @anijain2305. We can't set eval mode for timm models because that would disable bn testing, and we had serious bugs when we didn't test bn. Instead, we should leave the mode as training, but set dropout to 0 to avoid non-determinism (also, my impression was that dropout is already 0 for most timm models by default)

Discussed offline. Dropout is not the root cause here since it is already 0 for those failing models we have seen on CI. Setting torch.backends.cudnn.deterministic = True works, at least for the models I tried. Let's see if it solves all the eager_variance failures on CI.

anijain2305 · 2022-12-22T04:25:17Z

@pytorchbot merge

pytorchmergebot · 2022-12-22T04:27:01Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-12-22T06:38:13Z

Merge failed

Reason: 2 additional jobs have failed, first few of them are: inductor ,inductor / cuda11.6-py3.10-gcc7-sm86 / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

Summary: 1) Setting torch.backends.cudnn.deterministic to True helps to eliminate the eager_variance failures seen on CI 2) Skip Triton failure instead of retry 3) Some minor script cleanup is also included in this PR. [ghstack-poisoned]

Summary: 1) Setting torch.backends.cudnn.deterministic to True helps to eliminate the eager_variance failures seen on CI 2) Skip Triton failure instead of retry 3) Some minor script cleanup is also included in this PR. ghstack-source-id: 76e04e4 Pull Request resolved: #91283

desertfire · 2022-12-22T15:35:37Z

@pytorchbot merge -f "Only affects inductor shards and they have already passed"

pytorchmergebot · 2022-12-22T15:37:39Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: #91283 skips certain random triton failure on CI, but we need to check against the BackendCompilerFailed exception type. [ghstack-poisoned]

Summary: #91283 skips certain random triton failure on CI, but we need to check against the BackendCompilerFailed exception type. ghstack-source-id: 0db96f8 Pull Request resolved: #91634

Summary: #91283 skips certain random triton failure on CI, but we need to check against the BackendCompilerFailed exception type. Pull Request resolved: #91634 Approved by: https://github.com/ngimel

[WIP] investigate eager_variation

3fc0ada

[ghstack-poisoned]

github-actions bot added ciflow/inductor module: dynamo labels Dec 21, 2022

desertfire changed the title ~~[WIP] investigate eager_variation~~ [inductor] Set use_eval_mode for timm_model Dec 22, 2022

Update on "[inductor] Set use_eval_mode for timm_model"

0a9f675

Summary: We need to set use_eval_mode when checking training accuracy, not only for inductor but also for eager, to avoid randomness from dropout. Some minor script clean up is also included in this PR. [ghstack-poisoned]

Update on "[inductor] Set use_eval_mode for timm_model"

2173c1f

Summary: We need to set use_eval_mode when checking training accuracy, not only for inductor but also for eager, to avoid randomness from dropout. Some minor script clean up is also included in this PR. [ghstack-poisoned]

desertfire added the topic: not user facing topic category label Dec 22, 2022

desertfire requested review from eellison, mlazos and ngimel December 22, 2022 01:31

desertfire changed the title ~~[inductor] Set use_eval_mode for timm_model~~ [inductor] Set torch.backends.cudnn.deterministic Dec 22, 2022

Update on "[inductor] Set torch.backends.cudnn.deterministic"

d9db2f1

Summary: Setting torch.backends.cudnn.deterministic to True helps to eliminate the eager_variance failures seen on CI. Some minor script clean up is also included in this PR. [ghstack-poisoned]

anijain2305 approved these changes Dec 22, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 22, 2022

desertfire changed the title ~~[inductor] Set torch.backends.cudnn.deterministic~~ [inductor] CI improvments Dec 22, 2022

Update on "[inductor] CI improvments"

540110a

Summary: 1) Setting torch.backends.cudnn.deterministic to True helps to eliminate the eager_variance failures seen on CI 2) Skip Triton failure instead of retry 3) Some minor script cleanup is also included in this PR. [ghstack-poisoned]

pytorchmergebot added the Merged label Dec 22, 2022

pytorchmergebot closed this in 07c6168 Dec 22, 2022

desertfire mentioned this pull request Jan 3, 2023

[inductor] Check for BackendCompilerFailed on CI #91634

Closed

desertfire added a commit that referenced this pull request Jan 3, 2023

[inductor] Check for BackendCompilerFailed on CI

6bd7fc5

Summary: #91283 skips certain random triton failure on CI, but we need to check against the BackendCompilerFailed exception type. [ghstack-poisoned]

facebook-github-bot deleted the gh/desertfire/56/head branch June 8, 2023 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor] CI improvments #91283

[inductor] CI improvments #91283

Uh oh!

desertfire commented Dec 21, 2022 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Dec 21, 2022 •

edited

Loading

Uh oh!

ngimel commented Dec 22, 2022

Uh oh!

desertfire commented Dec 22, 2022

Uh oh!

anijain2305 commented Dec 22, 2022

Uh oh!

pytorchmergebot commented Dec 22, 2022

Uh oh!

pytorchmergebot commented Dec 22, 2022

Uh oh!

desertfire commented Dec 22, 2022

Uh oh!

pytorchmergebot commented Dec 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[inductor] CI improvments #91283

[inductor] CI improvments #91283

Uh oh!

Conversation

desertfire commented Dec 21, 2022 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91283

✅ No Failures

Uh oh!

ngimel commented Dec 22, 2022

Uh oh!

desertfire commented Dec 22, 2022

Uh oh!

anijain2305 commented Dec 22, 2022

Uh oh!

pytorchmergebot commented Dec 22, 2022

Merge started

Uh oh!

pytorchmergebot commented Dec 22, 2022

Merge failed

Uh oh!

desertfire commented Dec 22, 2022

Uh oh!

pytorchmergebot commented Dec 22, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

desertfire commented Dec 21, 2022 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 21, 2022 •

edited

Loading