Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSDP2][Test] Fix _test_clip_grad_norm #126457

Closed

Conversation

wz337
Copy link
Contributor

@wz337 wz337 commented May 16, 2024

Fixes #ISSUE_NUMBER
We need to compare ref_total_norm to total_norm.full_tensor().
Example:

iter_idx:0, rank:0,\
ref_total_norm=tensor(1052.5934, device='cuda:0'),\
total_norm=DTensor(local_tensor=482.0861511230469, device_mesh=DeviceMesh([0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2.0),)),\
total_norm.full_tensor()=tensor(1052.5934, device='cuda:0')

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Copy link

pytorch-bot bot commented May 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126457

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fd65b3c with merge base a0429c0 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels May 16, 2024
@wz337 wz337 force-pushed the fix_test_fully_shard_clip_grad_norm_ branch from a168e60 to 5e2051d Compare May 16, 2024 23:02
@wz337 wz337 force-pushed the fix_test_fully_shard_clip_grad_norm_ branch from 5e2051d to fd65b3c Compare May 16, 2024 23:03
@wz337 wz337 marked this pull request as ready for review May 16, 2024 23:05
@wz337 wz337 requested a review from awgu May 16, 2024 23:05
@wz337 wz337 changed the title fix _test_clip_grad_norm [FSDP2][Test] Fix _test_clip_grad_norm May 16, 2024
Copy link
Contributor

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@wz337
Copy link
Contributor Author

wz337 commented May 17, 2024

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 17, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024
Fixes #ISSUE_NUMBER
We need to compare ref_total_norm to total_norm.full_tensor().
Example:
```
iter_idx:0, rank:0,\
ref_total_norm=tensor(1052.5934, device='cuda:0'),\
total_norm=DTensor(local_tensor=482.0861511230469, device_mesh=DeviceMesh([0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2.0),)),\
total_norm.full_tensor()=tensor(1052.5934, device='cuda:0')
```

Pull Request resolved: pytorch#126457
Approved by: https://github.com/awgu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants