Skip to content

Conversation

@desertfire
Copy link
Contributor

@desertfire desertfire commented Dec 21, 2022

Stack from ghstack (oldest at bottom):

Summary:

  1. Setting torch.backends.cudnn.deterministic to True helps to
    eliminate the eager_variance failures seen on CI
  2. Skip Triton failure instead of retry
  3. Some minor script cleanup is also included in this PR.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 21, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91283

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 540110a:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@desertfire desertfire changed the title [WIP] investigate eager_variation [inductor] Set use_eval_mode for timm_model Dec 22, 2022
Summary: We need to set use_eval_mode when checking training accuracy,
not only for inductor but also for eager, to avoid randomness from
dropout. Some minor script clean up is also included in this PR.

[ghstack-poisoned]
Summary: We need to set use_eval_mode when checking training accuracy,
not only for inductor but also for eager, to avoid randomness from
dropout. Some minor script clean up is also included in this PR.

[ghstack-poisoned]
@ngimel
Copy link
Collaborator

ngimel commented Dec 22, 2022

cc @anijain2305. We can't set eval mode for timm models because that would disable bn testing, and we had serious bugs when we didn't test bn.
Instead, we should leave the mode as training, but set dropout to 0 to avoid non-determinism (also, my impression was that dropout is already 0 for most timm models by default)

@desertfire desertfire changed the title [inductor] Set use_eval_mode for timm_model [inductor] Set torch.backends.cudnn.deterministic Dec 22, 2022
Summary: Setting torch.backends.cudnn.deterministic to True helps to
eliminate the eager_variance failures seen on CI.
Some minor script clean up is also included in this PR.

[ghstack-poisoned]
@desertfire
Copy link
Contributor Author

cc @anijain2305. We can't set eval mode for timm models because that would disable bn testing, and we had serious bugs when we didn't test bn. Instead, we should leave the mode as training, but set dropout to 0 to avoid non-determinism (also, my impression was that dropout is already 0 for most timm models by default)

Discussed offline. Dropout is not the root cause here since it is already 0 for those failing models we have seen on CI. Setting torch.backends.cudnn.deterministic = True works, at least for the models I tried. Let's see if it solves all the eager_variance failures on CI.

@anijain2305
Copy link
Contributor

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 22, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 additional jobs have failed, first few of them are: inductor ,inductor / cuda11.6-py3.10-gcc7-sm86 / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@desertfire desertfire changed the title [inductor] Set torch.backends.cudnn.deterministic [inductor] CI improvments Dec 22, 2022
Summary:
1) Setting torch.backends.cudnn.deterministic to True helps to
eliminate the eager_variance failures seen on CI
2) Skip Triton failure instead of retry
3) Some minor script cleanup is also included in this PR.

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Dec 22, 2022
Summary:
1) Setting torch.backends.cudnn.deterministic to True helps to
eliminate the eager_variance failures seen on CI
2) Skip Triton failure instead of retry
3) Some minor script cleanup is also included in this PR.

ghstack-source-id: 76e04e4
Pull Request resolved: #91283
@desertfire
Copy link
Contributor Author

@pytorchbot merge -f "Only affects inductor shards and they have already passed"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

desertfire added a commit that referenced this pull request Jan 3, 2023
Summary: #91283 skips certain
random triton failure on CI, but we need to check against the
BackendCompilerFailed exception type.

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Jan 3, 2023
Summary: #91283 skips certain
random triton failure on CI, but we need to check against the
BackendCompilerFailed exception type.

ghstack-source-id: 0db96f8
Pull Request resolved: #91634
pytorchmergebot pushed a commit that referenced this pull request Jan 3, 2023
Summary: #91283 skips certain
random triton failure on CI, but we need to check against the
BackendCompilerFailed exception type.

Pull Request resolved: #91634
Approved by: https://github.com/ngimel
@facebook-github-bot facebook-github-bot deleted the gh/desertfire/56/head branch June 8, 2023 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants