Skip to content

Conversation

ngimel
Copy link
Collaborator

@ngimel ngimel commented Jun 8, 2021

Fixes #59584
@albanD, @soulitzer, renorm grad was completely busted. Fast gradcheck is definitely not doing its job.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 8, 2021

💊 CI failures summary and remediations

As of commit 0192aa2 (more details on the Dr. CI page):


  • 3/3 failures introduced in this PR

3 failures not recognized by patterns:

Job Step Action
CircleCI pytorch_linux_bionic_py3_6_clang9_noarch_test Report results 🔁 rerun
CircleCI pytorch_linux_xenial_py3_6_gcc5_4_test Report results 🔁 rerun
CircleCI pytorch_macos_10_13_py3_test Report results 🔁 rerun

3 jobs timed out:

  • pytorch_linux_bionic_py3_6_clang9_noarch_test
  • pytorch_linux_xenial_py3_6_gcc5_4_test
  • pytorch_macos_10_13_py3_test

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@ngimel ngimel requested a review from peterbell10 June 8, 2021 05:35
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

It is known that it is not as precise and can hide failures in some cases, that's why we are keeping the periodic slow gradcheck build. Note that this is the first time since introduced that it actually hides a failure, so it's not too common.
You can also set gradcheck_fast_mode=False on the OpInfo if you want to force it to run with slow gradcheck.
I am also working on adding a label to be able to test with slow gradcheck on PRs to make it easier to run it when we have doubts.

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator Author

ngimel commented Jun 8, 2021

"Can hide failure in some cases" and "doesn't flag completely wrong gradient computation" are different failure modes. This is the latter. Jacobians mostly consist of 0's, so if we are checking just this fact (not even positions of 0's), that's not very informative.

@ngimel ngimel force-pushed the ngimel/renorm_fix branch from 2c2f77b to 0192aa2 Compare June 8, 2021 17:06
@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

[maxnorm_v, eps_v, one_v](vec_t norm) -> vec_t {
auto fct = maxnorm_v / (norm + eps_v);
return vec_t::blendv(fct, one_v, norm > maxnorm_v);
return vec_t::blendv(one_v, fct, norm > maxnorm_v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this fix the issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a separate issue where cpu produced completely wrong results.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 9d533ef.

deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
Fixes pytorch#59584
albanD, soulitzer, `renorm` grad was completely busted. Fast gradcheck is definitely not doing its job.

Pull Request resolved: pytorch#59615

Reviewed By: jbschlosser

Differential Revision: D28964271

Pulled By: ngimel

fbshipit-source-id: b6878cd24db9189b64b67eb58bd2cd8956cda78a
@github-actions github-actions bot deleted the ngimel/renorm_fix branch February 12, 2024 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

renorm is failing for slow gradcheck

5 participants