Skip to content

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Jul 27, 2022

Implement bitwise operators as metal kernels
Dynamically compile metal library for a triplet of input and output tensor types.
Use dispatchThreads:threadsPerThreadgroup: to dispatch work (relies on the fact that MPS device is at least MTLGPUFamilyMac2, which will be explicitly checked in #82507

Perf improvements: Add support for non-contiguous tensors and broadcasting

Test Plan:
Already tested in test_mps.py, for example by TestConsistencyCPU.test_output_match_bitwise_xor_cpu_uint8

@malfet malfet requested a review from albanD July 27, 2022 11:52
@malfet malfet requested a review from kulinseth as a code owner July 27, 2022 11:52
@malfet malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 27, 2022
@malfet
Copy link
Contributor Author

malfet commented Jul 27, 2022

MPS CI is expected to fail for this one as it depends on fix provided by #82315

@malfet malfet force-pushed the malfet/mps-add-bitwise-ops branch from c5798c8 to 1b8c1d7 Compare July 28, 2022 03:57
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 28, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82307

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9b53708:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@kulinseth
Copy link
Collaborator

@saharshoza

@malfet malfet force-pushed the malfet/mps-add-bitwise-ops branch from 1b8c1d7 to 995f04d Compare August 3, 2022 18:29
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two small nit but LGTM !

at::Tensor output = output_;
bool needs_output_copy = false;

if (!output_.is_contiguous()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this after the resize?
if you actually resized, it will have become contiguous for free IIRC

@malfet malfet force-pushed the malfet/mps-add-bitwise-ops branch from 995f04d to 6b02272 Compare August 7, 2022 23:08
@malfet
Copy link
Contributor Author

malfet commented Aug 8, 2022

@pytorchbot merge -f "Run MPS is green"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2022

Hey @malfet.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

[commandEncoder setBuffer:outBuf offset:output.storage_offset()*output.itemsize() atIndex:1];
[commandEncoder setBuffer:selfBuf offset:self.storage_offset()*self.itemsize() atIndex:2];
[commandEncoder setBytes:&sval length:sizeof(sval) atIndex:3];
[commandEncoder dispatchThreadgroups:MTLSizeMake((length + 511) / 512, 1, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be dispatch1DJob as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet malfet deleted the malfet/mps-add-bitwise-ops branch August 8, 2022 15:55



TORCH_LIBRARY_IMPL(aten, MPS, m) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these be in the native_functions.yaml file?

@@ -403,5 +403,6 @@ void add_sub_template(const Tensor& self, const Tensor& other, const Scalar& alp
runMPSGraph(stream, cachedGraph->graph(), feeds, results);
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious line.

facebook-github-bot pushed a commit that referenced this pull request Aug 9, 2022
Summary:
Implement bitwise operators as metal kernels
Dynamically compile metal library for a triplet of input and output tensor types.
Use `dispatchThreads:threadsPerThreadgroup:` to dispatch work (relies on the fact that MPS device is at least `MTLGPUFamilyMac2`, which will be explicitly checked in #82507

Perf improvements: Add support for non-contiguous tensors and broadcasting

Pull Request resolved: #82307
Approved by: https://github.com/albanD

Test Plan:
contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/0377615b6855c7e306669a16e74bbfc01ab86c1c

Test plan from GitHub:
Already tested in  `test_mps.py`, for example by `TestConsistencyCPU.test_output_match_bitwise_xor_cpu_uint8`

Reviewed By: kit1980

Differential Revision: D38505765

Pulled By: malfet

fbshipit-source-id: f1265a4d43f0a0b52af622838c8873b32dffcbfb
pytorchmergebot pushed a commit that referenced this pull request Nov 1, 2024
PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in #82307 )
Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see  #125550 )

But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib`

This PR does exactly that by:
 - Moving shader sources into separate .metal files
 - Adds check on whether full Xcode installed or just DeveloperTools
 - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt.
 - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary`

Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following
```cpp
#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/OpName_metallib.h>
#endif
```

Some historical stats:
| PyTorch Version  | Number of shaders in MPS | Ops added |
| ------------- | ------------- | ---- |
| 1.12  | 0  | |
| 1.13  | 2  | bitwise_ops and  index.out |
| 2.0  | 4  | cross repeat and view)  |
| 2.1  | 9   | unary_ops, histogram, renorm, binary_ops |
| 2.2  | 11   | gamma and bucketization |
| 2.3  | 12  | naive_matmul (to workaround crash) |
| 2.4 | 13 | quantized_mm |
| 2.5 | 14 | fused_adam |

Pros:
  - Better code structure/readability
  - Eventually allows one to use shared headers (and implement something like `TensorIterator`)
  - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels

Cons:
  - Build process is a bit more complicated that it used to be
  - Need to maintain two codepath (as our CI builders only has DeveloperTools installed)

Pull Request resolved: #138636
Approved by: https://github.com/manuelcandales
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in pytorch#82307 )
Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see  pytorch#125550 )

But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib`

This PR does exactly that by:
 - Moving shader sources into separate .metal files
 - Adds check on whether full Xcode installed or just DeveloperTools
 - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt.
 - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary`

Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following
```cpp
#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/OpName_metallib.h>
#endif
```

Some historical stats:
| PyTorch Version  | Number of shaders in MPS | Ops added |
| ------------- | ------------- | ---- |
| 1.12  | 0  | |
| 1.13  | 2  | bitwise_ops and  index.out |
| 2.0  | 4  | cross repeat and view)  |
| 2.1  | 9   | unary_ops, histogram, renorm, binary_ops |
| 2.2  | 11   | gamma and bucketization |
| 2.3  | 12  | naive_matmul (to workaround crash) |
| 2.4 | 13 | quantized_mm |
| 2.5 | 14 | fused_adam |

Pros:
  - Better code structure/readability
  - Eventually allows one to use shared headers (and implement something like `TensorIterator`)
  - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels

Cons:
  - Build process is a bit more complicated that it used to be
  - Need to maintain two codepath (as our CI builders only has DeveloperTools installed)

Pull Request resolved: pytorch#138636
Approved by: https://github.com/manuelcandales
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants