Skip to content

Conversation

walterddr
Copy link
Contributor

No description provided.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 7, 2021

💊 CI failures summary and remediations

As of commit 6df490f (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@walterddr walterddr force-pushed the hackathon_opinfo_matmul branch from 9312e5f to d03d0b6 Compare April 7, 2021 23:09
@codecov
Copy link

codecov bot commented Apr 8, 2021

Codecov Report

Merging #55543 (6df490f) into master (f1a0b81) will decrease coverage by 0.19%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #55543      +/-   ##
==========================================
- Coverage   77.43%   77.24%   -0.20%     
==========================================
  Files        1892     1892              
  Lines      187597   187605       +8     
==========================================
- Hits       145268   144908     -360     
- Misses      42329    42697     +368     

@mruberry mruberry mentioned this pull request Apr 8, 2021
Comment on lines 3224 to 3280
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry which one of these skip is worth filing an issue?

I think test_unsupported_dtypes should definitely be one since neither support nor unsupport passes for float16.
not sure about the other 2 though.

@walterddr walterddr force-pushed the hackathon_opinfo_matmul branch from d03d0b6 to c302658 Compare April 8, 2021 23:50
((S, S, M, M), (M,), "4d_1d", (True,)),
((M,), (S, S, M, S), "1d_4d", (True,))]
sample_inputs = []
for input_args in test_cases:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for lhs_shape, rhs_shape in test_cases:



def sample_inputs_matmul(op_info, device, dtype, requires_grad):
test_cases = [((L,), (L,), '', (True,)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strings and bools can be removed from this iterable.

Style nit: prefer tuples to lists when the iterable doesn't need to be mutable.

((M,), (S, S, M, S), "1d_4d", (True,))]
sample_inputs = []
for input_args in test_cases:
args = (make_tensor(input_args[0], device, dtype, low=None, high=None, requires_grad=requires_grad),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lhs = make_tensor(lhs_shape, device, dtype, requires_grad=requires_grad)
rhs = make_tensor(rhs_shape, device, dtype, requires_grad=requires_grad)
sample_inputs.append(SampleInput(lhs, args=(rhs,))

sample_inputs_func=sample_inputs_matmul,
skips=(
# matmul does not correctly warn when resizing out= inputs
SkipInfo('TestCommon', 'test_out'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct and doesn't require an issue (the comments alone are enough to track out= behavior, and many ops do not implement their out= behavior properly)

skips=(
# matmul does not correctly warn when resizing out= inputs
SkipInfo('TestCommon', 'test_out'),
# matmul on complex128 doesn't work
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what's going on here? This may require an issue.

# matmul on complex128 doesn't work
SkipInfo('TestGradients', 'test_fn_grad',
device_type='cpu', dtypes=(torch.complex128,)),
# some how it doesn't throw on cpu, and crash when enabled
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes! Crashing sounds like an issue. Maybe even a high priority issue!

Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, @walterddr. I made a few suggestions for code simplification and for two issues to file. With those tweaks and links to both issues in comments this should be OK to land!

@mruberry mruberry self-requested a review April 12, 2021 05:25
Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Thanks @walterddr!

@facebook-github-bot
Copy link
Contributor

@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@walterddr merged this pull request in 08561ca.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by dab1cdf.

@mruberry
Copy link
Collaborator

Looks like this hit another build with no PR CI signal :(

Should be a straightforward fix. If you submit it using ci-all/ that will let us validate the issue is resolved.

@anjali411
Copy link
Contributor

Hey @walterddr this PR fails on cuda 11. These tests are currently not run by CI by default. I think you need to add:

dtypesIfCUDA=floating_types_and(torch.float16, torch.complex64, torch.complex128,
                                           *[torch.bfloat16] if (CUDA11OrLater and SM53orLater) else []),

to your op-info. Also could you please submit the PR to be relanded to ci-all?

@imaginary-person
Copy link
Contributor

imaginary-person commented Apr 12, 2021

Hello @anjali411 & @walterddr,

It seems like dot is not implemented for BFloat16 at all, so the sample input for matmul that uses dot would cause the test to fail even with the check mentioned in the above comment.

return AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND(at::ScalarType::Half, self.scalar_type(), "dot", [&] {

The GPUs used are SM75, so the check you mentioned will enable the failing test anyway, i.e. *[torch.bfloat16] if (CUDA11OrLater and SM53orLater) else [] would be torch.bfloat16.

What about not using that sample input if the dtype is BFloat16?
Thanks!

Ref for GPU capability:

:: default on circleci is Tesla T4 which has capability of 7.5, ref: https://developer.nvidia.com/cuda-gpus

@imaginary-person

This comment has been minimized.

@mruberry
Copy link
Collaborator

@mruberry, @anjali411 & @heitorschueroff: for such PRs, how about (before importing them, during development), running the tests that don't run by default by temporarily modifying pytorch/.circleci/config.yml for such a PR, and then reverting that file to its original state after those CI checks complete?

Submitting from a branch that starts with "ci-all/" should run all the checks without having to make those changes, though.

@mruberry
Copy link
Collaborator

Hello @anjali411 & @walterddr,

It seems like dot is not implemented for BFloat16 at all, so the sample input for matmul that uses dot would cause the test to fail even with the check mentioned in the above comment.

return AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND(at::ScalarType::Half, self.scalar_type(), "dot", [&] {

The GPUs used are SM75, so the check you mentioned will enable the failing test anyway, i.e. *[torch.bfloat16] if (CUDA11OrLater and SM53orLater) else [] would be torch.bfloat16.

What about not using that sample input if the dtype is BFloat16?
Thanks!

Thanks for this analysis, @imaginary-person. Seems like removing bfloat16 from the list of supported matmul dtypes (on CUDA) is a good idea. What do you think, @walterddr?

@imaginary-person
Copy link
Contributor

imaginary-person commented Apr 12, 2021

Submitting from a branch that starts with "ci-all/" should run all the checks without having to make those changes, though.

Thank you, @mruberry! That seems a lot more convenient.

EDIT: I can't use it, though, as only folks at FB have the permission to do so 😞

@imaginary-person

This comment has been minimized.

@mruberry
Copy link
Collaborator

Submitting from a branch that starts with "ci-all/" should run all the checks without having to make those changes, though.

Thank you, @mruberry! That seems a lot more convenient.

EDIT: I can't use it, though, as only folks at FB have the permission to do so 😞

I'm sorry; I thought it was available to everyone. @malfet, do we provide a mechanism for the OSS community to use ci-all?

@imaginary-person
Copy link
Contributor

No worries, @mruberry! I was able to run tests on CUDA 11.1 by modifying the circle ci config file (temporarily).

@imaginary-person
Copy link
Contributor

imaginary-person commented Apr 13, 2021

@walterddr, I was wrong about the CUDA compute capability being 75 in the failing test.
It's 52 in the logs. I don't know if that's intentional (tests backward compatibility?), or if the GPU isn't actually an Nvidia Tesla T4.

@walterddr
Copy link
Contributor Author

@walterddr, I was wrong about the CUDA compute capability being 75 in the failing test.
It's 52 in the logs. I don't know if that's intentional (tests backward compatibility?), or if the GPU isn't actually an Nvidia Tesla T4.

thanks for the report. Let me create a follow PR from the base repo and tag you

@imaginary-person
Copy link
Contributor

thanks for the report. Let me create a follow PR from the base repo and tag you

No worries! I only mentioned because I had misinformed you earlier. It seems only Windows tests use SM_75 GPUs.
It doesn't change the approach required towards this PR - you should remove bfloat16, as @mruberry mentioned.

facebook-github-bot pushed a commit that referenced this pull request Apr 14, 2021
Summary:
This is a reland of #55543 after fixing bfloat16 issues.

Pull Request resolved: #55947

Reviewed By: mruberry

Differential Revision: D27765035

Pulled By: walterddr

fbshipit-source-id: b27a769de7686777012194ebbc1f38fc5d4acb67
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary: Pull Request resolved: pytorch#55543

Reviewed By: samestep

Differential Revision: D27708944

Pulled By: walterddr

fbshipit-source-id: c200ded15082eaeed7ba3077a0c8629fed0db505
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
This is a reland of pytorch#55543 after fixing bfloat16 issues.

Pull Request resolved: pytorch#55947

Reviewed By: mruberry

Differential Revision: D27765035

Pulled By: walterddr

fbshipit-source-id: b27a769de7686777012194ebbc1f38fc5d4acb67
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants