Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend nonzero to int64 #125850

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

extend nonzero to int64 #125850

wants to merge 16 commits into from

Conversation

bhack
Copy link
Contributor

@bhack bhack commented May 9, 2024

Fixes #51871

@pytorch-bot pytorch-bot bot added the release notes: cuda release notes category label May 9, 2024
Copy link

pytorch-bot bot commented May 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125850

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 3 Unrelated Failures

As of commit a46722a with merge base 8f30f36 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@bhack
Copy link
Contributor Author

bhack commented May 9, 2024

/cc @ezyang @eqy This is an explorative blackbox PR as I don't have free cuda resources right now and we don't have a quick way to setup the env to contribute sparse c++/cuda PR (see #125297).

But I made this editable on your side in the case you have the env ready and a quick fix is enough.

@bhack bhack marked this pull request as ready for review May 9, 2024 16:52
@bhack bhack requested a review from eqy as a code owner May 9, 2024 16:52
Copy link

linux-foundation-easycla bot commented May 10, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Collaborator

@eqy eqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a (presumably large tensor) test for this?

@bhack
Copy link
Contributor Author

bhack commented May 10, 2024

Should we add a (presumably large tensor) test for this?

Do we had an INT_MAX test already somewhere that we could expand?

@eqy
Copy link
Collaborator

eqy commented May 10, 2024

Unfortunately these are not really unified at the moment, but this should surface some examples: https://github.com/search?q=repo%3Apytorch%2Fpytorch+64bit+language%3APython+path%3A%2F%5Etest%5C%2F%2F&type=code

@bhack
Copy link
Contributor Author

bhack commented May 10, 2024

As we don't have a specific CUDA test do we want to find a workaround from python?

Can you suggest one from grep -R torch.nonzero test/?

@bhack
Copy link
Contributor Author

bhack commented May 10, 2024

I think I am going to close this as cub::DispatchSelectIf probably it will be slower then cub::DeviceSelect::Flagged we are currently using.

Probably we need to wait upstream for NVIDIA/cccl#1422

What do you think?

@bhack
Copy link
Contributor Author

bhack commented May 11, 2024

@ezyang Do you think we can we open a new ticket to lower this with Trition where and sum?
https://github.com/pytorch/pytorch/blob/a174c536f8f32b41da9efa647e364196423468a5/torch/_inductor/lowering.py#L2187C20-L2187C35

Edit:
The ticket is at #126003

using flag_iterator_t = cub::NullType*;
using equality_op_t = cub::NullType;

return cub::DispatchSelectIf<
Copy link
Contributor Author

@bhack bhack May 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this requries cub/cccl 2.4.0?

@ezyang
Copy link
Contributor

ezyang commented May 11, 2024

yes, need a big tensor test. @eqy's link is good for examples

@bhack
Copy link
Contributor Author

bhack commented May 11, 2024

Ok thanks,
so I am going to close it as I don't have the env and currently spare GPU computing to write a brand new test and recompile it.
At least if we don't identify another python test that it is already indirectly using nonzero and that we could modify it with a big input.

@bhack
Copy link
Contributor Author

bhack commented May 11, 2024

Just to check if it could compile at least with the current CUB version.
Do you know what is this CI failure?

/usr/local/cuda/include/cub/agent/agent_select_if.cuh(264): error: function "at::native::<unnamed>::NonZeroOp<T>::operator() [with T=c10::complex<c10::Half>]" cannot be called with the given argument list
            argument types are: (int64_t)
            object type is: at::native::<unnamed>::NonZeroOp<c10::complex<c10::Half>>
                  selection_flags[ITEM] = select_op(items[ITEM]);

@bhack
Copy link
Contributor Author

bhack commented May 13, 2024

I think we need cub/cub/agent/agent_select_if.cuh changes introduced at NVIDIA/cccl#1379

So this mean that we need to wait for the next cuda 12.4 update and make it also conditional.

@ezyang
Copy link
Contributor

ezyang commented May 14, 2024

This PR seems fine. I agree you may need to preprocessor your way to victory. CI will say.

@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 14, 2024
@ezyang ezyang self-requested a review May 14, 2024 18:02
@bhack
Copy link
Contributor Author

bhack commented May 23, 2024

@ezyang the new [CUDA] 12.5](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) delivers CUB 2.4.0 so it could be enough for this workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open source release notes: cuda release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tensor.nonzero fails on GPU for tensors containing more than INT_MAX elements
4 participants