[MPS] And native bitwise_[and|or|xor] #82307

malfet · 2022-07-27T11:52:42Z

Implement bitwise operators as metal kernels
Dynamically compile metal library for a triplet of input and output tensor types.
Use dispatchThreads:threadsPerThreadgroup: to dispatch work (relies on the fact that MPS device is at least MTLGPUFamilyMac2, which will be explicitly checked in #82507

Perf improvements: Add support for non-contiguous tensors and broadcasting

Test Plan:
Already tested in test_mps.py, for example by TestConsistencyCPU.test_output_match_bitwise_xor_cpu_uint8

malfet · 2022-07-27T14:25:52Z

MPS CI is expected to fail for this one as it depends on fix provided by #82315

pytorch-bot · 2022-07-28T03:57:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82307

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9b53708:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kulinseth · 2022-07-28T23:36:13Z

@saharshoza

albanD

Only two small nit but LGTM !

aten/src/ATen/native/mps/operations/BitwiseBinaryOps.mm

albanD · 2022-08-05T20:05:10Z

aten/src/ATen/native/mps/operations/BitwiseBinaryOps.mm

+  at::Tensor output = output_;
+  bool needs_output_copy = false;
+
+  if (!output_.is_contiguous()) {


Move this after the resize?
if you actually resized, it will have become contiguous for free IIRC

aten/src/ATen/native/mps/operations/BitwiseBinaryOps.mm

Small fixup

malfet · 2022-08-08T04:16:49Z

@pytorchbot merge -f "Run MPS is green"

pytorchmergebot · 2022-08-08T04:18:05Z

@pytorchbot successfully started a merge job. Check the current status here

github-actions · 2022-08-08T04:18:51Z

Hey @malfet.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

albanD · 2022-08-08T14:15:58Z

aten/src/ATen/native/mps/operations/BitwiseBinaryOps.mm

+    [commandEncoder setBuffer:outBuf offset:output.storage_offset()*output.itemsize() atIndex:1];
+    [commandEncoder setBuffer:selfBuf offset:self.storage_offset()*self.itemsize()  atIndex:2];
+    [commandEncoder setBytes:&sval length:sizeof(sval) atIndex:3];
+    [commandEncoder dispatchThreadgroups:MTLSizeMake((length + 511) / 512, 1, 1)


Should be dispatch1DJob as well?

kulinseth · 2022-08-08T18:21:50Z

aten/src/ATen/native/mps/operations/BitwiseBinaryOps.mm

+
+
+
+TORCH_LIBRARY_IMPL(aten, MPS, m) {


Shouldn't these be in the native_functions.yaml file?

kulinseth · 2022-08-08T18:23:15Z

aten/src/ATen/native/mps/operations/BinaryOps.mm

@@ -403,5 +403,6 @@ void add_sub_template(const Tensor& self, const Tensor& other, const Scalar& alp
        runMPSGraph(stream, cachedGraph->graph(), feeds, results);
      }
 }
+


Spurious line.

Summary: Implement bitwise operators as metal kernels Dynamically compile metal library for a triplet of input and output tensor types. Use `dispatchThreads:threadsPerThreadgroup:` to dispatch work (relies on the fact that MPS device is at least `MTLGPUFamilyMac2`, which will be explicitly checked in #82507 Perf improvements: Add support for non-contiguous tensors and broadcasting Pull Request resolved: #82307 Approved by: https://github.com/albanD Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/0377615b6855c7e306669a16e74bbfc01ab86c1c Test plan from GitHub: Already tested in `test_mps.py`, for example by `TestConsistencyCPU.test_output_match_bitwise_xor_cpu_uint8` Reviewed By: kit1980 Differential Revision: D38505765 Pulled By: malfet fbshipit-source-id: f1265a4d43f0a0b52af622838c8873b32dffcbfb

PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in #82307 ) Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see #125550 ) But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib` This PR does exactly that by: - Moving shader sources into separate .metal files - Adds check on whether full Xcode installed or just DeveloperTools - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt. - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary` Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following ```cpp #ifndef PYTORCH_JIT_COMPILE_SHADERS static auto& lib = MetalShaderLibrary::getBundledLibrary(); #else #include <ATen/native/mps/OpName_metallib.h> #endif ``` Some historical stats: | PyTorch Version | Number of shaders in MPS | Ops added | | ------------- | ------------- | ---- | | 1.12 | 0 | | | 1.13 | 2 | bitwise_ops and index.out | | 2.0 | 4 | cross repeat and view) | | 2.1 | 9 | unary_ops, histogram, renorm, binary_ops | | 2.2 | 11 | gamma and bucketization | | 2.3 | 12 | naive_matmul (to workaround crash) | | 2.4 | 13 | quantized_mm | | 2.5 | 14 | fused_adam | Pros: - Better code structure/readability - Eventually allows one to use shared headers (and implement something like `TensorIterator`) - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels Cons: - Build process is a bit more complicated that it used to be - Need to maintain two codepath (as our CI builders only has DeveloperTools installed) Pull Request resolved: #138636 Approved by: https://github.com/manuelcandales

PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in pytorch#82307 ) Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see pytorch#125550 ) But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib` This PR does exactly that by: - Moving shader sources into separate .metal files - Adds check on whether full Xcode installed or just DeveloperTools - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt. - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary` Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following ```cpp #ifndef PYTORCH_JIT_COMPILE_SHADERS static auto& lib = MetalShaderLibrary::getBundledLibrary(); #else #include <ATen/native/mps/OpName_metallib.h> #endif ``` Some historical stats: | PyTorch Version | Number of shaders in MPS | Ops added | | ------------- | ------------- | ---- | | 1.12 | 0 | | | 1.13 | 2 | bitwise_ops and index.out | | 2.0 | 4 | cross repeat and view) | | 2.1 | 9 | unary_ops, histogram, renorm, binary_ops | | 2.2 | 11 | gamma and bucketization | | 2.3 | 12 | naive_matmul (to workaround crash) | | 2.4 | 13 | quantized_mm | | 2.5 | 14 | fused_adam | Pros: - Better code structure/readability - Eventually allows one to use shared headers (and implement something like `TensorIterator`) - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels Cons: - Build process is a bit more complicated that it used to be - Need to maintain two codepath (as our CI builders only has DeveloperTools installed) Pull Request resolved: pytorch#138636 Approved by: https://github.com/manuelcandales

malfet requested a review from albanD July 27, 2022 11:52

malfet requested a review from kulinseth as a code owner July 27, 2022 11:52

facebook-github-bot added the cla signed label Jul 27, 2022

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 27, 2022

malfet force-pushed the malfet/mps-add-bitwise-ops branch from c5798c8 to 1b8c1d7 Compare July 28, 2022 03:57

philipturner mentioned this pull request Jul 28, 2022

[Discussion] How to add MPS extension with custom kernel? #81103

Closed

malfet force-pushed the malfet/mps-add-bitwise-ops branch from 1b8c1d7 to 995f04d Compare August 3, 2022 18:29

albanD approved these changes Aug 5, 2022

View reviewed changes

malfet added 3 commits August 7, 2022 16:07

[MPS] And native bitwise_[and|or|xor]

4802dfe

Address review feedback

7510f81

Address review feedback2

6b02272

malfet force-pushed the malfet/mps-add-bitwise-ops branch from 995f04d to 6b02272 Compare August 7, 2022 23:08

And commit command buffer to stream

5a09dd7

malfet commented Aug 8, 2022

View reviewed changes

aten/src/ATen/native/mps/operations/BitwiseBinaryOps.mm Outdated Show resolved Hide resolved

Update aten/src/ATen/native/mps/operations/BitwiseBinaryOps.mm

9b53708

Small fixup

pytorchmergebot added the Merged label Aug 8, 2022

pytorchmergebot closed this in 0377615 Aug 8, 2022

albanD reviewed Aug 8, 2022

View reviewed changes

malfet deleted the malfet/mps-add-bitwise-ops branch August 8, 2022 15:55

malfet mentioned this pull request Aug 8, 2022

General MPS op coverage tracking issue #77764

Open

kulinseth reviewed Aug 8, 2022

View reviewed changes

kulinseth mentioned this pull request Aug 22, 2022

[RFC] Adding custom Metal kernel for mps ops requiring high performance #80437

Closed

malfet mentioned this pull request Oct 24, 2024

[MPS] Compile kernels into Metallib #138636

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MPS] And native bitwise_[and|or|xor] #82307

[MPS] And native bitwise_[and|or|xor] #82307

Uh oh!

malfet commented Jul 27, 2022 •

edited

Loading

Uh oh!

malfet commented Jul 27, 2022

Uh oh!

pytorch-bot bot commented Jul 28, 2022 •

edited

Loading

Uh oh!

kulinseth commented Jul 28, 2022

Uh oh!

albanD left a comment

Uh oh!

Uh oh!

albanD Aug 5, 2022

Uh oh!

Uh oh!

malfet commented Aug 8, 2022

Uh oh!

pytorchmergebot commented Aug 8, 2022

Uh oh!

github-actions bot commented Aug 8, 2022

Uh oh!

albanD Aug 8, 2022

Uh oh!

malfet Aug 8, 2022

Uh oh!

kulinseth Aug 8, 2022

Uh oh!

kulinseth Aug 8, 2022

Uh oh!

Uh oh!




		TORCH_LIBRARY_IMPL(aten, MPS, m) {

[MPS] And native bitwise_[and|or|xor] #82307

[MPS] And native bitwise_[and|or|xor] #82307

Uh oh!

Conversation

malfet commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet commented Jul 27, 2022

Uh oh!

pytorch-bot bot commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82307

✅ No Failures

Uh oh!

kulinseth commented Jul 28, 2022

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albanD Aug 5, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

malfet commented Aug 8, 2022

Uh oh!

pytorchmergebot commented Aug 8, 2022

Uh oh!

github-actions bot commented Aug 8, 2022

Uh oh!

albanD Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

malfet Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

kulinseth Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

kulinseth Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

malfet commented Jul 27, 2022 •

edited

Loading

pytorch-bot bot commented Jul 28, 2022 •

edited

Loading