-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[MPS] And native bitwise_[and|or|xor] #82307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MPS CI is expected to fail for this one as it depends on fix provided by #82315 |
c5798c8
to
1b8c1d7
Compare
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82307
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9b53708: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
1b8c1d7
to
995f04d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only two small nit but LGTM !
at::Tensor output = output_; | ||
bool needs_output_copy = false; | ||
|
||
if (!output_.is_contiguous()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this after the resize?
if you actually resized, it will have become contiguous for free IIRC
995f04d
to
6b02272
Compare
@pytorchbot merge -f "Run MPS is green" |
@pytorchbot successfully started a merge job. Check the current status here |
Hey @malfet. |
[commandEncoder setBuffer:outBuf offset:output.storage_offset()*output.itemsize() atIndex:1]; | ||
[commandEncoder setBuffer:selfBuf offset:self.storage_offset()*self.itemsize() atIndex:2]; | ||
[commandEncoder setBytes:&sval length:sizeof(sval) atIndex:3]; | ||
[commandEncoder dispatchThreadgroups:MTLSizeMake((length + 511) / 512, 1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be dispatch1DJob as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
|
||
TORCH_LIBRARY_IMPL(aten, MPS, m) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these be in the native_functions.yaml
file?
@@ -403,5 +403,6 @@ void add_sub_template(const Tensor& self, const Tensor& other, const Scalar& alp | |||
runMPSGraph(stream, cachedGraph->graph(), feeds, results); | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spurious line.
Summary: Implement bitwise operators as metal kernels Dynamically compile metal library for a triplet of input and output tensor types. Use `dispatchThreads:threadsPerThreadgroup:` to dispatch work (relies on the fact that MPS device is at least `MTLGPUFamilyMac2`, which will be explicitly checked in #82507 Perf improvements: Add support for non-contiguous tensors and broadcasting Pull Request resolved: #82307 Approved by: https://github.com/albanD Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/0377615b6855c7e306669a16e74bbfc01ab86c1c Test plan from GitHub: Already tested in `test_mps.py`, for example by `TestConsistencyCPU.test_output_match_bitwise_xor_cpu_uint8` Reviewed By: kit1980 Differential Revision: D38505765 Pulled By: malfet fbshipit-source-id: f1265a4d43f0a0b52af622838c8873b32dffcbfb
PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in #82307 ) Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see #125550 ) But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib` This PR does exactly that by: - Moving shader sources into separate .metal files - Adds check on whether full Xcode installed or just DeveloperTools - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt. - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary` Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following ```cpp #ifndef PYTORCH_JIT_COMPILE_SHADERS static auto& lib = MetalShaderLibrary::getBundledLibrary(); #else #include <ATen/native/mps/OpName_metallib.h> #endif ``` Some historical stats: | PyTorch Version | Number of shaders in MPS | Ops added | | ------------- | ------------- | ---- | | 1.12 | 0 | | | 1.13 | 2 | bitwise_ops and index.out | | 2.0 | 4 | cross repeat and view) | | 2.1 | 9 | unary_ops, histogram, renorm, binary_ops | | 2.2 | 11 | gamma and bucketization | | 2.3 | 12 | naive_matmul (to workaround crash) | | 2.4 | 13 | quantized_mm | | 2.5 | 14 | fused_adam | Pros: - Better code structure/readability - Eventually allows one to use shared headers (and implement something like `TensorIterator`) - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels Cons: - Build process is a bit more complicated that it used to be - Need to maintain two codepath (as our CI builders only has DeveloperTools installed) Pull Request resolved: #138636 Approved by: https://github.com/manuelcandales
PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in pytorch#82307 ) Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see pytorch#125550 ) But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib` This PR does exactly that by: - Moving shader sources into separate .metal files - Adds check on whether full Xcode installed or just DeveloperTools - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt. - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary` Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following ```cpp #ifndef PYTORCH_JIT_COMPILE_SHADERS static auto& lib = MetalShaderLibrary::getBundledLibrary(); #else #include <ATen/native/mps/OpName_metallib.h> #endif ``` Some historical stats: | PyTorch Version | Number of shaders in MPS | Ops added | | ------------- | ------------- | ---- | | 1.12 | 0 | | | 1.13 | 2 | bitwise_ops and index.out | | 2.0 | 4 | cross repeat and view) | | 2.1 | 9 | unary_ops, histogram, renorm, binary_ops | | 2.2 | 11 | gamma and bucketization | | 2.3 | 12 | naive_matmul (to workaround crash) | | 2.4 | 13 | quantized_mm | | 2.5 | 14 | fused_adam | Pros: - Better code structure/readability - Eventually allows one to use shared headers (and implement something like `TensorIterator`) - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels Cons: - Build process is a bit more complicated that it used to be - Need to maintain two codepath (as our CI builders only has DeveloperTools installed) Pull Request resolved: pytorch#138636 Approved by: https://github.com/manuelcandales
Implement bitwise operators as metal kernels
Dynamically compile metal library for a triplet of input and output tensor types.
Use
dispatchThreads:threadsPerThreadgroup:
to dispatch work (relies on the fact that MPS device is at leastMTLGPUFamilyMac2
, which will be explicitly checked in #82507Perf improvements: Add support for non-contiguous tensors and broadcasting
Test Plan:
Already tested in
test_mps.py
, for example byTestConsistencyCPU.test_output_match_bitwise_xor_cpu_uint8