Skip to content

Conversation

@yushangdi
Copy link
Contributor

  • When tensor numel is 0, we let the hash be 0 instead of hashing, because torch.hash_tensor doesn't work for 0 numel tensors
  • Add some tests for distributed

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169027

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit bf2047a with merge base 481e5ab (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@yushangdi yushangdi force-pushed the sy_debug_mode_test branch 2 times, most recently from 8b10b27 to a8651a1 Compare November 25, 2025 01:07
@yushangdi yushangdi requested review from ngimel and pianpwk November 25, 2025 01:07
@yushangdi yushangdi changed the title Add debug mode tests [DebugMode] Fix hash for 0 ele tensor; Add more tests Nov 25, 2025
@yushangdi yushangdi force-pushed the sy_debug_mode_test branch 2 times, most recently from 541de94 to 094a7a4 Compare November 25, 2025 01:16
@yushangdi yushangdi added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Nov 25, 2025
if t.numel() > 0:
out = torch.hash_tensor(t_clean)
else:
out = torch.tensor(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably still want to avoid sync here, out = torch.zeros((), device=t_clean.device)

@yushangdi yushangdi force-pushed the sy_debug_mode_test branch 2 times, most recently from 0a262b2 to 497d194 Compare November 26, 2025 00:21
@yushangdi
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #169027, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

@yushangdi
Copy link
Contributor Author

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased sy_debug_mode_test onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout sy_debug_mode_test && git pull --rebase)

bdhirsh added a commit that referenced this pull request Dec 1, 2025
…hashing outputs"

this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Dec 1, 2025
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
@yushangdi
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.gfx942.1)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
- When tensor numel is 0, we let the hash be 0 instead of hashing, because torch.hash_tensor doesn't work for 0 numel tensors
- Add some tests for distributed
Pull Request resolved: #169027
Approved by: https://github.com/xmfan, https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants