Skip to content

Expand the coverage of test_addmm and test_addmm_sizes #43831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Aug 29, 2020

  • This test is very fast and very important, so it makes no sense in marking it as slowTest
  • This test is should also run on CUDA
  • This test should check alpha and beta support
  • This test should check out= support
  • manual computation should use list instead of index_put because list is much faster
  • precision for TF32 needs to be fixed. Will do it in future PR.

@zasdfgbnm zasdfgbnm requested review from ngimel and mruberry August 29, 2020 08:32
self.assertEqual(res1, res2, atol=prec, rtol=0)
}[dtype]

if False and dtype.is_complex: # bug to be fixed in another PR
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fixed in #43827

@codecov
Copy link

codecov bot commented Aug 29, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@4e4626a). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #43831   +/-   ##
=========================================
  Coverage          ?   40.16%           
=========================================
  Files             ?      378           
  Lines             ?    46728           
  Branches          ?        0           
=========================================
  Hits              ?    18767           
  Misses            ?    27961           
  Partials          ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e4626a...511266f. Read the comment docs.

@dr-ci
Copy link

dr-ci bot commented Aug 29, 2020

💊 CI failures summary and remediations

As of commit c9b002c (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 21 times.

@zasdfgbnm zasdfgbnm changed the title Expand the coverage of test_addmm Expand the coverage of test_addmm and test_addmm_sizes Aug 29, 2020
@dtypesIfCUDA(*torch.testing.get_all_complex_dtypes(), *torch.testing.get_all_fp_dtypes(include_bfloat16=False))
@dtypes(*torch.testing.get_all_complex_dtypes(), *torch.testing.get_all_fp_dtypes())
def test_addmm(self, device, dtype):
prec = {
Copy link
Collaborator

@ngimel ngimel Aug 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use @precisionOverride for this?

Copy link
Collaborator Author

@zasdfgbnm zasdfgbnm Aug 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. And note that the precision for bfloat16 is bumped to 0.6. Because we are now accumulating in float32 instead of bfloat16 scalar.

AssertionError: False is not true : Tensors failed to compare as equal! With rtol=0.016 and atol=0.1, found 2 element(s) (out of 250) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.53125 (8.375 vs. 7.84375), which occurred at index (6, 4).

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Waiting for CI.
Not for this PR, but we should also test that with beta=0 nans and infs in M are not propagated, right now I think we are doing it only for empty inputs.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator

ngimel commented Sep 1, 2020

@ailzhang do you know why xla does not inherit precisionOverride here for bfloat16 (0.6) ? Is it ok if we disable bfloat16 on xla?

@precisionOverride({torch.double: 1e-8, torch.float: 1e-4, torch.bfloat16: 0.6,
torch.half: 1e-1, torch.cfloat: 1e-4, torch.cdouble: 1e-8})
@dtypesIfCUDA(*torch.testing.get_all_complex_dtypes(), *torch.testing.get_all_fp_dtypes(include_bfloat16=False))
@dtypes(*torch.testing.get_all_complex_dtypes(), *torch.testing.get_all_fp_dtypes())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please not include bfloat16 here, and include it only in @dtypesIfCPU so that bfloat16 is not run on XLA?

@ailzhang
Copy link
Contributor

ailzhang commented Sep 1, 2020

cc: @JackCaoG made some changes to bfloat16 test skip, do you have an idea what's the best workaround here?

@JackCaoG
Copy link
Collaborator

JackCaoG commented Sep 1, 2020

@ailzhang the change was to auto skip all of the float16 tests, we still want to run bfloat16 tests. My guess is that pt/xla has its own precision overwrite for each type and ignore pytorch's precision.

@ngimel
Copy link
Collaborator

ngimel commented Sep 1, 2020

@JackCaoG @ailzhang so what do you guys suggest we do? Currently bfloat16 test is failing with

Tensors failed to compare as equal! With rtol=0.001 and atol=0.001, found 182 element(s) (out of 250) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.53125 (8.375 vs. 7.84375), which occurred at index (6, 4).

tbh, rtol 0.001 and atol 0.001 seems incredibly low for bfloat16 even under the best of circumstances, typically errors are much larger. And here the test definitely won't pass with such tolerances. Is there a way to override tolerances on the xla side?

@JackCaoG
Copy link
Collaborator

JackCaoG commented Sep 1, 2020

@JackCaoG @ailzhang so what do you guys suggest we do? Currently bfloat16 test is failing with

Tensors failed to compare as equal! With rtol=0.001 and atol=0.001, found 182 element(s) (out of 250) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.53125 (8.375 vs. 7.84375), which occurred at index (6, 4).

tbh, rtol 0.001 and atol 0.001 seems incredibly low for bfloat16 even under the best of circumstances, typically errors are much larger. And here the test definitely won't pass with such tolerances. Is there a way to override tolerances on the xla side?

@ngimel Let me submit a pr on the pt/xla side to disable this test on our end. Ideally we should take pytorch precision overwrite if it is bigger than pt/xla's. I will investigate a bit on this too.

@zasdfgbnm
Copy link
Collaborator Author

Looks like GitHub automatically closed this by mistake? I will reopen and rebase.

@zasdfgbnm zasdfgbnm reopened this Sep 2, 2020
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in bc45c47.

@zasdfgbnm zasdfgbnm deleted the test_addmm branch September 3, 2020 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants