Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 64bit indexing support for softmax #52713

Closed
wants to merge 10 commits into from
Closed

Add 64bit indexing support for softmax #52713

wants to merge 10 commits into from

Conversation

zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Feb 24, 2021

fixes #52715 #52716

split across batch dimension

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 24, 2021

💊 CI failures summary and remediations

As of commit 48f0eba (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Feb 24 19:00:20 sccache: error: couldn't connect to server
Feb 24 19:00:20 +++ eval 'extract_trap_cmd '
Feb 24 19:00:20 ++++ extract_trap_cmd
Feb 24 19:00:20 ++++ printf '%s\n' ''
Feb 24 19:00:20 +++ printf '%s\n' cleanup
Feb 24 19:00:20 ++ trap -- '
Feb 24 19:00:20 cleanup' EXIT
Feb 24 19:00:20 ++ [[ pytorch-xla-linux-bionic-py3.6-clang9-test != *pytorch-win-* ]]
Feb 24 19:00:20 ++ which sccache
Feb 24 19:00:20 ++ sccache --stop-server
Feb 24 19:00:20 Stopping sccache server...
Feb 24 19:00:20 sccache: error: couldn't connect to server
Feb 24 19:00:20 sccache: caused by: Connection refused (os error 111)
Feb 24 19:00:20 ++ true
Feb 24 19:00:20 ++ rm /var/lib/jenkins/sccache_error.log
Feb 24 19:00:20 ++ [[ pytorch-xla-linux-bionic-py3.6-clang9-test == *rocm* ]]
Feb 24 19:00:20 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log
Feb 24 19:00:20 ++ SCCACHE_IDLE_TIMEOUT=1200
Feb 24 19:00:20 ++ RUST_LOG=sccache::server=error
Feb 24 19:00:20 ++ sccache --start-server
Feb 24 19:00:20 sccache: Starting the server...
Feb 24 19:00:20 ++ sccache --zero-stats

1 job timed out:

  • pytorch_xla_linux_bionic_py3_6_clang9_test

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@zasdfgbnm
Copy link
Collaborator Author

cc: @ptrblck

@zasdfgbnm zasdfgbnm added the module: cuda Related to torch.cuda, and CUDA support in general label Feb 24, 2021
@zasdfgbnm zasdfgbnm linked an issue Feb 24, 2021 that may be closed by this pull request
Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add tests?

aten/src/ATen/native/cuda/SoftMax.cu Outdated Show resolved Hide resolved
@zasdfgbnm
Copy link
Collaborator Author

@ngimel I have fixed the bug you catch, and added a test. The test passes on my 3090.

@@ -11975,6 +11975,25 @@ def test_softmax_results(self, device, dtype):
self.assertEqual(grad_input, ref_grad_input)
self.assertEqual(input.grad, ref_input.grad)

@onlyCUDA
@dtypesIfCUDA(torch.float, torch.half)
@largeTensorTest("20GB")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my 3090, half takes ~18GB mem, and float takes ~19.8GB

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these tests run in your CI?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CI has A100 and 3090, so yes!

@@ -11975,6 +11975,25 @@ def test_softmax_results(self, device, dtype):
self.assertEqual(grad_input, ref_grad_input)
self.assertEqual(input.grad, ref_input.grad)

@onlyCUDA
@dtypesIfCUDA(torch.float, torch.half)
@largeTensorTest("20GB")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these tests run in your CI?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@malfet malfet added this to the 1.8.1 milestone Feb 24, 2021
@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in a6b7da7.

@zasdfgbnm zasdfgbnm deleted the ima-softmax branch February 25, 2021 14:04
aocsa pushed a commit to Quansight/pytorch that referenced this pull request Mar 15, 2021
Summary:
fixes pytorch#52715 pytorch#52716

split across batch dimension

Pull Request resolved: pytorch#52713

Reviewed By: ailzhang

Differential Revision: D26640033

Pulled By: ngimel

fbshipit-source-id: f169cb0d6abc1cfbddf658d9775759a7d56f5c12
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
fixes pytorch#52715 pytorch#52716

split across batch dimension

Pull Request resolved: pytorch#52713

Reviewed By: ailzhang

Differential Revision: D26640033

Pulled By: ngimel

fbshipit-source-id: f169cb0d6abc1cfbddf658d9775759a7d56f5c12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged module: cuda Related to torch.cuda, and CUDA support in general open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CUDA error: invalid configuration argument for softmax CUDA Illegal memory access for softmax
5 participants