Skip to content

Conversation

razarmehr
Copy link
Collaborator

  • This change introduces these APIs to enable developing custom kernels on the MPS Stream:
    torch::mps::get_command_buffer()
    torch::mps::get_dispatch_queue()
    torch::mps::commit()
  • Add ObjC test case

@pytorch-bot
Copy link

pytorch-bot bot commented May 4, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100661

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f12a876:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels May 4, 2023
@razarmehr razarmehr added the ciflow/trunk Trigger trunk jobs on your pull request label May 4, 2023
@kulinseth kulinseth requested a review from albanD May 4, 2023 21:42
@razarmehr
Copy link
Collaborator Author

The CI trunk failures seem unrelated to the PR:
bash: ./.ci/pytorch/python_doc_push_script.sh: No such file or directory

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw if the answer to a lot of my questions is "there is a single stream so this is not relevant", it should be made very explicit.

@razarmehr razarmehr force-pushed the MPS_CustomKernel branch from 8f0f34f to 5dcef82 Compare May 5, 2023 20:54
@razarmehr razarmehr requested a review from albanD May 5, 2023 20:58
Copy link
Collaborator

@kulinseth kulinseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @albanD for the review .
And @razarmehr for the PR . The change looks good

Copy link
Collaborator

@kulinseth kulinseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @albanD for the review .
And @razarmehr for the PR . The change looks good

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but please undef the define in header. (And again, would be nice to submit as separate PR as it has nothing to do with adding support for customer Kernels, is it?
And not sure why rename _mps_synchronize to ``_mps_DeviceSynchronize`

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in itself sounds ok but I think we need to be a lot stricter on the documentation if we expect non-MPS maintainers to be able to use this / be able to review PRs touching MPS code.

@razarmehr
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: .github/workflows/lint.yml / lintrunner / linux-job

Details for Dev Infra team Raised by workflow job

@pytorch pytorch deleted a comment from pytorch-bot bot May 8, 2023
@razarmehr
Copy link
Collaborator Author

@pytorchbot merge -f "all checks except the pre-existing lint-linux failure are green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch pytorch deleted a comment from pytorchmergebot May 11, 2023
@pytorch pytorch deleted a comment from pytorchmergebot May 11, 2023
@razarmehr razarmehr requested a review from malfet May 11, 2023 20:10
@kulinseth
Copy link
Collaborator

this a great addition. I have been reading the differences between implementing raw metal kernel vs using the MPSGraph api. In the current MPSGraph api has a concept of re-using Tensor cached graph to reduce allocation overhead. See here:

auto cachedGraph = LookUpOrCreateCachedGraph<MPSUnaryCachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {

Can we do something like this if we write raw metal kernel?

Thanks @TaiPhamD. For raw Metal kernel you have the flexibility to cache the PSO as you see fit. We end up caching the Graph as there is CPU overhead in compiling it and we can do this as shapes are known. For Metal we already have caching built in-place for you in OS and you get that already.

@facebook-github-bot
Copy link
Contributor

@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@razarmehr
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

@malfet
Copy link
Contributor

malfet commented May 15, 2023

@pytorchbot merge -f "Internal builds are fine"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

jcaip pushed a commit that referenced this pull request May 23, 2023
- This change introduces these APIs to enable developing custom kernels on the MPS Stream:
`torch::mps::get_command_buffer()`
`torch::mps::get_dispatch_queue()`
`torch::mps::commit()`
- Add ObjC test case
Pull Request resolved: #100661
Approved by: https://github.com/kulinseth, https://github.com/malfet
pytorchmergebot pushed a commit that referenced this pull request Jul 23, 2023
I've added the implementation of erfinv using the algorithm from https://github.com/pytorch/pytorch/blob/4154c8ea159fdaecc71ee9af820ac956193c875b/aten/src/ATen/native/Math.h#L152 in order for the MPS based algorithm to match the CPU automatic test. This PR is using the new metal api calls from #100661

Testing shows MPS has a decent speed up (270x) compared to CPU on tensor size of 100 mil elements.
```
import torch
x = torch.arange(-1, 1, 1e-8) # default cpu tensor
#measure CPU compute time by calling torch.erfinv
time = %timeit -o -q -r 5 torch.erfinv(x)
cpu_time = time.average
print("CPU torch.erfinv time: ", cpu_time)
x = x.to("mps")
# measure MPS compute time
time = %timeit -o -q -r 5 torch.erfinv(x)
mps_time = time.average
print("MPS torch.erfinv time: ", mps_time)
print(f"MPS torch.erfinv is {cpu_time/mps_time*100} percent faster than CPU torch.erfinv")

# compute MSE between MPS and CPU torch.erfinv
x = x.to("cpu")
y_cpu = torch.erfinv(x)
x = x.to("mps")
y_mps = torch.erfinv(x)
y_mps = y_mps.to("cpu")
mask = torch.isfinite(y_cpu) & torch.isfinite(y_mps.to("cpu"))
y_mps = y_mps[mask]
y_cpu = y_cpu[mask]
x = x[mask]
print(f"length of y_mps: {len(y_mps)}, length of y_cpu: {len(y_cpu)}, length of x: {len(x)}")
mse = torch.square(y_cpu - y_mps).mean()
print("MSE between MPS and CPU torch.erfinv: ", mse)
diff = torch.abs(y_cpu - y_mps)
print("Largest difference")
print(f"x:  {x[torch.argmax(diff)]}, y_cpu: {y_cpu[torch.argmax(diff)]}, y_mps: {y_mps[torch.argmax(diff)]} , diff = {y_cpu[torch.argmax(diff)] - y_mps[torch.argmax(diff)]}")
```
CPU torch.erfinv time:  2.654937833400254
MPS torch.erfinv time:  0.009831255332002912
MPS torch.erfinv is 27005.07456822776 percent faster than CPU torch.erfinv
length of y_mps: 199999992, length of y_cpu: 199999992, length of x: 199999992
MSE between MPS and CPU torch.erfinv:  tensor(4.2339e-14)
Largest difference
x:  -0.9999980330467224, y_cpu: -3.363569736480713, y_mps: -3.3635685443878174 , diff = -1.1920928955078125e-06

Fixes ##86808

Pull Request resolved: #101507
Approved by: https://github.com/kulinseth
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/mps Run MPS tests (subset of trunk) ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: mps Release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.