Probable fix for out of place BinaryOpScalar bad values and/or IMAs on 11.2 #52591

mcarilli · 2021-02-22T07:57:46Z

Should close #51992.

Suggested by one of our compiler people (Hari Sandanagobalane, don't know github handle). Also big thanks to @ngimel @zasdfgbnm for distilling a minimal repro of the original failures.

Good news is, the bug is a likely a foreach kernel bug and not an 11.2 compiler bug. Therefore, in theory it could affect any cuda version. We think 11.2 exposed it by optimizing more aggressively than previous toolkits.

IIRC some foreach tests were disabled because of this bug, but I'm not sure which or how many. @ngimel @izdeby what tests should I reenable?

facebook-github-bot · 2021-02-22T07:57:54Z

💊 CI failures summary and remediations

As of commit 60c811b (more details on the Dr. CI page):

1/1 failures introduced in this PR

1 failure not recognized by patterns:

Job	Step	Action
^flake8-py3	^{Add annotations}	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

codecov · 2021-02-22T11:35:23Z

Codecov Report

Merging #52591 (e53c5d3) into master (d491fc6) will increase coverage by 0.55%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #52591      +/-   ##
==========================================
+ Coverage   80.20%   80.76%   +0.55%     
==========================================
  Files        1969     1969              
  Lines      216041   216063      +22     
==========================================
+ Hits       173284   174502    +1218     
+ Misses      42757    41561    -1196

zasdfgbnm · 2021-02-22T14:08:09Z

Please reenable these tests: https://github.com/pytorch/pytorch/pull/51598/files#diff-f2b37d0b5812153acf9ff1e337d375c5cad0076c4ddc30535ec2d55c8ac9b770

izdeby · 2021-02-22T19:41:20Z

@mcarilli, thank you for this! Looks like the fix is working.

ngimel · 2021-02-22T19:58:44Z

Please change tests

pytorch/test/test_foreach.py

Line 353 in 4386a38

def test_int_scalar(self, device, dtype):

to make sure that this is caught by foreach tests (it definitely was not caught previously, foreach tests were passing on 11.2)

ngimel · 2021-02-22T20:05:35Z

aten/src/ATen/native/cuda/ForeachFunctors.cuh

            for(int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * kILP) {
-                load_args<depth>(r_args, args, i_start, chunk_size, n);
+                // Regardless if "depth" is 1 (for inplace) or 2 (for out of place), r_args has depth 1
+                load_args<1>(r_args, args, i_start, chunk_size, n);


depth template argument is no longer needed

mcarilli · 2021-02-22T22:27:41Z

Closing to resubmit to ci-all (#52634).

…n 11.2 (ci-all edition) (#52634) Summary: Should close #51992. ci-all resubmit of #52591. The plot also thickened considerably since then. Every foreach functor, it turns out, has bad `r_args` accesses for certain code paths and instantiations. Also, I noticed the [`n % kILP == 0`](https://github.com/pytorch/pytorch/blob/2680ff7759d8a441eada383ba7aa0fa42c7d35ed/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L87) condition for vectorization in all functors is way too restrictive: it'll refuse to vectorize anything on any tensor whose overall numel is not a multiple of ILP. That's out of scope though. Pull Request resolved: #52634 Reviewed By: H-Huang Differential Revision: D26725991 Pulled By: izdeby fbshipit-source-id: 4bade0ac186bf85527baddc1c44b2c2b8e3c9777

…n 11.2 (ci-all edition) (pytorch#52634) Summary: Should close pytorch#51992. ci-all resubmit of pytorch#52591. The plot also thickened considerably since then. Every foreach functor, it turns out, has bad `r_args` accesses for certain code paths and instantiations. Also, I noticed the [`n % kILP == 0`](https://github.com/pytorch/pytorch/blob/2680ff7759d8a441eada383ba7aa0fa42c7d35ed/aten/src/ATen/native/cuda/ForeachFunctors.cuh#L87) condition for vectorization in all functors is way too restrictive: it'll refuse to vectorize anything on any tensor whose overall numel is not a multiple of ILP. That's out of scope though. Pull Request resolved: pytorch#52634 Reviewed By: H-Huang Differential Revision: D26725991 Pulled By: izdeby fbshipit-source-id: 4bade0ac186bf85527baddc1c44b2c2b8e3c9777

should fix IMAs

e53c5d3

facebook-github-bot added the cla signed label Feb 22, 2021

mcarilli requested review from izdeby and ngimel February 22, 2021 07:57

pytorchbot added the open source label Feb 22, 2021

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 22, 2021

izdeby approved these changes Feb 22, 2021

View reviewed changes

izdeby mentioned this pull request Feb 22, 2021

[ignore] test cuda fix #52602

Closed

Reenable foreach optimizer tests for 11.2 in test_optim

60c811b

ngimel reviewed Feb 22, 2021

View reviewed changes

mcarilli mentioned this pull request Feb 22, 2021

Probable fix for out of place BinaryOpScalar bad values and/or IMAs on 11.2 (ci-all edition) #52634

Closed

mcarilli closed this Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Probable fix for out of place BinaryOpScalar bad values and/or IMAs on 11.2 #52591

Probable fix for out of place BinaryOpScalar bad values and/or IMAs on 11.2 #52591

Uh oh!

mcarilli commented Feb 22, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Feb 22, 2021 •

edited

Loading

Uh oh!

codecov bot commented Feb 22, 2021

Uh oh!

zasdfgbnm commented Feb 22, 2021

Uh oh!

izdeby commented Feb 22, 2021

Uh oh!

ngimel commented Feb 22, 2021

Uh oh!

ngimel Feb 22, 2021

Uh oh!

mcarilli commented Feb 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Probable fix for out of place BinaryOpScalar bad values and/or IMAs on 11.2 #52591

Probable fix for out of place BinaryOpScalar bad values and/or IMAs on 11.2 #52591

Uh oh!

Conversation

mcarilli commented Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

1 failure not recognized by patterns:

Uh oh!

codecov bot commented Feb 22, 2021

Codecov Report

Uh oh!

zasdfgbnm commented Feb 22, 2021

Uh oh!

izdeby commented Feb 22, 2021

Uh oh!

ngimel commented Feb 22, 2021

Uh oh!

ngimel Feb 22, 2021

Choose a reason for hiding this comment

Uh oh!

mcarilli commented Feb 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mcarilli commented Feb 22, 2021 •

edited

Loading

facebook-github-bot commented Feb 22, 2021 •

edited

Loading