[MPS] grad scaler #150255

Isalia20 · 2025-03-29T11:09:52Z

Basic implementation is done. What's left:

Different dtype/device tensors in the TensorList
fast path for grouping the foreach kernel
Tests

Regarding tests, I found some tests in test/test_torch.py for GradScaler but I couldn't figure out what is the best way to enable the test for MPS device.

By removing @onlyNativeDeviceTypes, one enables the tests for MPS but also enables tests for all other devices which are not included in the native device types. If I put:
instantiate_device_type_tests(TestTorchDeviceType, globals(), allow_mps=True)

This enables lots of tests in that class for MPS which were not(?) being tested before? This part needs some clarification

cc @mcarilli @ptrblck @leslie-fang-intel @jgong5 @kulinseth @albanD @malfet @DenisVieriu97 @jhavukainen

pytorch-bot · 2025-03-29T11:09:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150255

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 70 Pending

As of commit 8d72182 with merge base 7ac8186 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-03-29T11:13:24Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

Isalia20 · 2025-03-29T22:22:34Z

Something seems to be wrong with MPS tests. They are green when I build locally 🤔

Isalia20 · 2025-03-30T08:43:36Z

Okay that fixed the errors, ready for review now

Isalia20 · 2025-03-30T08:44:31Z

aten/src/ATen/native/mps/operations/MultiTensorApply.h

+#define lib _ignored_lib_name_for_fused
+#include <ATen/native/mps/FusedOptimizerOps_metallib.h>
+#undef lib


Not sure if there is a better way to do this 🤔

Please don't move library instantiation to header (and never import two libraries at once)

malfet · 2025-04-06T15:54:27Z

Few meta points about this PR:

Pay attention to difference between signed/unsigned types, for example if one to look at

pytorch/aten/src/ATen/native/cuda/AmpKernels.cu

Lines 181 to 186 in 55e62ff

    
           __global__ void amp_update_scale_cuda_kernel(float* current_scale, 
        
                                                        int* growth_tracker, 
        
                                                        const float* found_inf, 
        
                                                        double growth_factor, 
        
                                                        double backoff_factor, 
        
                                                        int growth_interval)

both growth_tracker and growth_interval are signed type, but in your PR for some reason growth_interval turned into an unsigned one, which might be fine semantically, but in that case would be good to add check that value is positive before casting it to the unsigned type

I'm not sure if AMP testing Is fully contained to ciflow/mps workflow, so getting a signal from trunk would be great
When you see too many known failures in pytofch-bot/DrCI comments, rebase to stable, otherwise signal can be occluded
Pay attention to error checking and do it as early as possible during the kernel execution
If you opt in to allocate GPU memory manually, make sure you are free it after use

aten/src/ATen/native/mps/operations/Amp.mm

aten/src/ATen/native/mps/kernels/Amp.metal

aten/src/ATen/native/mps/operations/MultiTensorApply.h

aten/src/ATen/native/mps/operations/Amp.mm

aten/src/ATen/native/mps/operations/MultiTensorApply.h

aten/src/ATen/native/mps/operations/Amp.mm

aten/src/ATen/native/mps/kernels/Amp.metal

malfet · 2025-04-06T16:19:27Z

@pytorchbot merge

pytorchmergebot · 2025-04-06T16:22:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2025-04-06T17:05:00Z

@pytorchbot merge -f "Lint + MPS are green, hopefully trunk as well"

pytorchmergebot · 2025-04-06T17:05:18Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2025-04-06T17:06:43Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kurzdev · 2025-04-06T19:12:57Z

Awesome, tysm! ❤️

Fixes pytorch#142397 Basic implementation is done. What's left: - [x] Different dtype/device tensors in the TensorList - [x] fast path for grouping the foreach kernel - [x] Tests Regarding tests, I found some tests in `test/test_torch.py` for GradScaler but I couldn't figure out what is the best way to enable the test for MPS device. By removing `@onlyNativeDeviceTypes`, one enables the tests for MPS but also enables tests for all other devices which are not included in the native device types. If I put: `instantiate_device_type_tests(TestTorchDeviceType, globals(), allow_mps=True)` This enables lots of tests in that class for MPS which were not(?) being tested before? This part needs some clarification Pull Request resolved: pytorch#150255 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

Isalia20 requested review from kulinseth, malfet, albanD and janeyx99 as code owners March 29, 2025 11:09

pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) module: amp (automated mixed precision) autocast release notes: mps Release notes category labels Mar 29, 2025

pytorchbot added the open source label Mar 29, 2025

Isalia20 added the module: mps Related to Apple Metal Performance Shaders framework label Mar 29, 2025

Isalia20 marked this pull request as draft March 29, 2025 12:20

pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) and removed ciflow/mps Run MPS tests (subset of trunk) labels Mar 29, 2025

Isalia20 added the ciflow/mps Run MPS tests (subset of trunk) label Mar 29, 2025

Isalia20 marked this pull request as ready for review March 29, 2025 22:22

pytorch-bot bot removed the ciflow/mps Run MPS tests (subset of trunk) label Mar 29, 2025

Isalia20 added the ciflow/mps Run MPS tests (subset of trunk) label Mar 29, 2025

pytorch-bot bot removed the ciflow/mps Run MPS tests (subset of trunk) label Mar 29, 2025

Isalia20 added the ciflow/mps Run MPS tests (subset of trunk) label Mar 29, 2025

pytorch-bot bot removed the ciflow/mps Run MPS tests (subset of trunk) label Mar 30, 2025

Isalia20 added the ciflow/mps Run MPS tests (subset of trunk) label Mar 30, 2025

pytorch-bot bot removed the ciflow/mps Run MPS tests (subset of trunk) label Mar 30, 2025

Isalia20 added the ciflow/mps Run MPS tests (subset of trunk) label Mar 30, 2025

pytorch-bot bot removed the ciflow/mps Run MPS tests (subset of trunk) label Mar 30, 2025

Isalia20 added the ciflow/mps Run MPS tests (subset of trunk) label Mar 30, 2025

Isalia20 commented Mar 30, 2025

View reviewed changes

malfet added ciflow/trunk Trigger trunk jobs on your pull request ciflow/mps Run MPS tests (subset of trunk) labels Apr 6, 2025

malfet approved these changes Apr 6, 2025

View reviewed changes

malfet reviewed Apr 6, 2025

View reviewed changes

aten/src/ATen/native/mps/operations/MultiTensorApply.h Outdated Show resolved Hide resolved

aten/src/ATen/native/mps/operations/MultiTensorApply.h Outdated Show resolved Hide resolved

aten/src/ATen/native/mps/operations/MultiTensorApply.h Outdated Show resolved Hide resolved

Apply suggestions from code review

a8c9892

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/mps Run MPS tests (subset of trunk) labels Apr 6, 2025

malfet reviewed Apr 6, 2025

View reviewed changes

aten/src/ATen/native/mps/operations/Amp.mm Show resolved Hide resolved

Apply suggestions from code review

5c90efa

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 6, 2025

malfet reviewed Apr 6, 2025

View reviewed changes

aten/src/ATen/native/mps/operations/MultiTensorApply.h Outdated Show resolved Hide resolved

aten/src/ATen/native/mps/operations/Amp.mm Outdated Show resolved Hide resolved

malfet reviewed Apr 6, 2025

View reviewed changes

aten/src/ATen/native/mps/kernels/Amp.metal Outdated Show resolved Hide resolved

Apply suggestions from code review

8d72182

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Apr 6, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 6, 2025

pytorchmergebot added the merging label Apr 6, 2025

pytorchmergebot added the Merged label Apr 6, 2025

pytorchmergebot closed this in 49f6cce Apr 6, 2025

pytorchmergebot removed the merging label Apr 6, 2025

[MPS] grad scaler #150255

[MPS] grad scaler #150255

Uh oh!

Conversation

Isalia20 commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150255

⏳ No Failures, 70 Pending

Uh oh!

github-actions bot commented Mar 29, 2025

Attention! native_functions.yaml was changed

Uh oh!

Isalia20 commented Mar 29, 2025

Uh oh!

Isalia20 commented Mar 30, 2025

Uh oh!

Isalia20 Mar 30, 2025

Choose a reason for hiding this comment

Uh oh!

malfet Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

malfet commented Apr 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

malfet commented Apr 6, 2025

Uh oh!

pytorchmergebot commented Apr 6, 2025

Merge started

Uh oh!

malfet commented Apr 6, 2025

Uh oh!

pytorchmergebot commented Apr 6, 2025

Uh oh!

pytorchmergebot commented Apr 6, 2025

Merge started

Uh oh!

kurzdev commented Apr 6, 2025

Uh oh!

Uh oh!

Isalia20 commented Mar 29, 2025 •

edited

Loading

pytorch-bot bot commented Mar 29, 2025 •

edited

Loading

malfet commented Apr 6, 2025 •

edited

Loading