[MPS] Register index.Tensor_out #82507

kulinseth · 2022-07-29T22:26:05Z

Add more tests from test_indexing into test_mps
Cache the indexing library on the MPSDevice

facebook-github-bot · 2022-07-29T22:26:11Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82507
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (5 Pending)

As of commit 132ba61 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

malfet

Lots of nits, but two major things:

_mtl_indexing_library must be set to nil in MTLDevice construtor
Please extract dispatch logic in a standalone PR

aten/src/ATen/native/mps/OperationUtils.h

aten/src/ATen/mps/MPSDevice.mm

aten/src/ATen/mps/MPSDevice.h

aten/src/ATen/mps/MPSDevice.mm

aten/src/ATen/mps/MPSDevice.h

aten/src/ATen/native/mps/operations/Indexing.mm

kulinseth · 2022-08-01T21:42:08Z

Please extract dispatch logic in a standalone PR
[MPS] Add dispatch stub code for MPS backend. #82612

aten/src/ATen/mps/IndexKernels.h

aten/src/ATen/native/mps/operations/Indexing.mm

philipturner · 2022-08-04T01:49:28Z

I have a question about the design of this PR. Statically typing incurs greatly more compile-time overhead with minimal improvements to runtime execution speed. It can increase the number of shader objects by a factor of 10 and flood the GPU instruction cache. Have you tried a dynamically typed approach that requires only one Metal shader, and benchmarked performance?

Implement bitwise operators as metal kernels Dynamically compile metal library for a triplet of input and output tensor types. Use `dispatchThreads:threadsPerThreadgroup:` to dispatch work (relies on the fact that MPS device is at least `MTLGPUFamilyMac2`, which will be explicitly checked in #82507 Perf improvements: Add support for non-contiguous tensors and broadcasting Test Plan: Already tested in `test_mps.py`, for example by `TestConsistencyCPU.test_output_match_bitwise_xor_cpu_uint8` Pull Request resolved: #82307 Approved by: https://github.com/albanD

Summary: Implement bitwise operators as metal kernels Dynamically compile metal library for a triplet of input and output tensor types. Use `dispatchThreads:threadsPerThreadgroup:` to dispatch work (relies on the fact that MPS device is at least `MTLGPUFamilyMac2`, which will be explicitly checked in #82507 Perf improvements: Add support for non-contiguous tensors and broadcasting Pull Request resolved: #82307 Approved by: https://github.com/albanD Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/0377615b6855c7e306669a16e74bbfc01ab86c1c Test plan from GitHub: Already tested in `test_mps.py`, for example by `TestConsistencyCPU.test_output_match_bitwise_xor_cpu_uint8` Reviewed By: kit1980 Differential Revision: D38505765 Pulled By: malfet fbshipit-source-id: f1265a4d43f0a0b52af622838c8873b32dffcbfb

kulinseth · 2022-08-15T23:28:33Z

@malfet , please take a look at the PR.

philipturner · 2022-08-16T00:01:29Z

constant const on these buffer bindings is redundant. The Metal compiler already describes constant as const constant in error messages. The semantic meaning of the constant address space is to be const device, but explicitly state that it will likely fall into the uniform registers. Please remove the const in buffer bindings.

malfet · 2022-08-16T00:57:02Z

@kulinseth at the very least this PR needs a rebase(as dispatch code has already been landed in #82612 )and fix for the linter, leaving a few more comments right now

malfet · 2022-08-16T01:05:59Z

aten/src/ATen/mps/MPSDevice.mm

+  // MPS Advanced Indexing needs at least Metal 2.0 (support for Argument Buffers and function constants)
+  MTLLanguageVersion languageVersion;
+
+#if defined(__MAC_13_0) && __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_13_0
+  languageVersion = MTLLanguageVersion3_0;
+#elif defined(__MAC_12_0) && __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_12_0
+  languageVersion = MTLLanguageVersion2_4;
+#elif
+  #error "Metal is not available on the current platform."
+#endif


Code as its currently written will fail to compile on MacOS 14.0, when it will be eventually released.
And hardcoding version would prevent as well prevent us from encountering a runtime errors, if newer MTL Language Standards would make some incompatible changes.
(And applied compiler optimizations are independent of MTLLanguageVersion, are they?)

Suggested change

// MPS Advanced Indexing needs at least Metal 2.0 (support for Argument Buffers and function constants)

MTLLanguageVersion languageVersion;

#if defined(__MAC_13_0) && __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_13_0

languageVersion = MTLLanguageVersion3_0;

#elif defined(__MAC_12_0) && __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_12_0

languageVersion = MTLLanguageVersion2_4;

#elif

#error "Metal is not available on the current platform."

#endif

// MPS Advanced Indexing needs at least Metal 2.0 (support for Argument Buffers and function constants)

// host_name attribute needs at least Metal 2.2

MTLLanguageVersion languageVersion = MTLLanguageVersion2_2;

Code as its currently written will fail to compile on MacOS 14.0, when it will be eventually released.

This shouldn't fail to compile on newer macOS. MacOS Ventura is 13.0 and according to the macro logic, it would fall into this macro logic:

#if defined(__MAC_13_0) && __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_13_0 languageVersion = MTLLanguageVersion3_0;

For higher numbers it would fall into the same macro logic (this __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_13_0 is checking for the number to be >= 13, not strictly equal to 13)

And hardcoding version would prevent as well prevent us from encountering a runtime errors, if newer MTL Language Standards would make some incompatible changes.

These are backwards compatible, and generally we don't even need to update the Language version unless we do have to use a Metal feature from that. And if we do need to use some new language feature, we can bump the logic here.

Hardcoding to 2_2 as per suggestion.

aten/src/ATen/mps/MPSDevice.mm

malfet · 2022-08-16T01:13:41Z

aten/src/ATen/mps/MPSDevice.h

  ~MPSDevice();

 private:
  static MPSDevice* _device;
  MTLDevice_t _mtl_device;
+  MTLLibrary_t _mtl_indexing_library;


Why does it need to be a part of MPSDevice rather than implementation detail in Indexing.mm?

This is cached on the MPSDevice itself the very first time index.Tensor_out is called - all the subsequent calls will use directly the cached version of the library

@malfet are you worried about adding overhead during device initialization ?. I was thinking it will be better to load the kernels during the device init time, rather than taking the hit when we need the indexing operation.

No, just from code clarify point of view - mtl_indexing_library is implementation detail of Indexing operator and should not leak into MPSDevice. We can implement a RAII mechanism of registering libraries with MPSDevice, but individual implementations IMO do not belong here.

aten/src/ATen/native/mps/operations/Indexing.h

aten/src/ATen/mps/IndexKernels.h

aten/src/ATen/native/mps/operations/Indexing.h

malfet · 2022-08-16T01:38:05Z

aten/src/ATen/native/mps/operations/Indexing.h

+    case ScalarType::Float:
+      res = "float"; break;
+    case ScalarType::Half:
+      res = "half";  break;
+    case ScalarType::Long:
+      res = "long";  break;
+    case ScalarType::Int:
+      res = "int";   break;
+    case ScalarType::Short:
+      res = "short"; break;
+    case ScalarType::Char:
+      res = "char"; break;
+    case ScalarType::Byte:
+      res = "uchar"; break;
+    case ScalarType::Bool:
+      res = "bool";  break;


You need one kernel per type size, isn't it?

Suggested change

case ScalarType::Float:

res = "float"; break;

case ScalarType::Half:

res = "half"; break;

case ScalarType::Long:

res = "long"; break;

case ScalarType::Int:

res = "int"; break;

case ScalarType::Short:

res = "short"; break;

case ScalarType::Char:

res = "char"; break;

case ScalarType::Byte:

res = "uchar"; break;

case ScalarType::Bool:

res = "bool"; break;

case ScalarType::Long:

res = "64bit"; break;

case ScalarType::Float:

case ScalarType::Int:

res = "32bit"; break;

case ScalarType::Half:

case ScalarType::Short:

res = "16bit"; break;

case ScalarType::Char:

case ScalarType::Byte:

case ScalarType::Bool:

res = "8bit"; break;

malfet · 2022-08-16T01:40:34Z

aten/src/ATen/mps/IndexKernels.h

+template
+[[host_name("index_select_float")]]
+kernel void index_select<float>(constant const IndexAB & indexAB       [[buffer(0)]],
+                                constant const void    * indexSizes    [[buffer(1)]],
+                                constant const void    * indexStrides  [[buffer(2)]],
+                                constant const uint3   * offsets       [[buffer(3)]],
+                                constant const void    * inputData     [[buffer(4)]],
+                                device void          * outputData    [[buffer(5)]],
+                                uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_half")]]
+kernel void index_select<half>(constant const IndexAB & indexAB       [[buffer(0)]],
+                               constant const void   * indexSizes    [[buffer(1)]],
+                               constant const void   * indexStrides  [[buffer(2)]],
+                               constant const uint3  * offsets       [[buffer(3)]],
+                               constant const void   * inputData     [[buffer(4)]],
+                               device void         * outputData    [[buffer(5)]],
+                               uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_long")]]
+kernel void index_select<long>(constant const IndexAB & indexAB       [[buffer(0)]],
+                               constant const void    * indexSizes    [[buffer(1)]],
+                               constant const void    * indexStrides  [[buffer(2)]],
+                               constant const uint3   * offsets       [[buffer(3)]],
+                               constant const void    * inputData     [[buffer(4)]],
+                               device void          * outputData    [[buffer(5)]],
+                               uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_int")]]
+kernel void index_select<int>(constant const IndexAB & indexAB       [[buffer(0)]],
+                              constant const void    * indexSizes    [[buffer(1)]],
+                              constant const void    * indexStrides  [[buffer(2)]],
+                              constant const uint3   * offsets       [[buffer(3)]],
+                              constant const void    * inputData     [[buffer(4)]],
+                              device void          * outputData    [[buffer(5)]],
+                              uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_short")]]
+kernel void index_select<short>(constant const IndexAB & indexAB       [[buffer(0)]],
+                                constant const void    * indexSizes    [[buffer(1)]],
+                                constant const void    * indexStrides  [[buffer(2)]],
+                                constant const uint3   * offsets       [[buffer(3)]],
+                                constant const void    * inputData     [[buffer(4)]],
+                                device void          * outputData    [[buffer(5)]],
+                                uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_char")]]
+kernel void index_select<char>(constant const IndexAB & indexAB       [[buffer(0)]],
+                               constant const void    * indexSizes    [[buffer(1)]],
+                               constant const void    * indexStrides  [[buffer(2)]],
+                               constant const uint3   * offsets       [[buffer(3)]],
+                               constant const void    * inputData     [[buffer(4)]],
+                               device void          * outputData    [[buffer(5)]],
+                               uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_uchar")]]
+kernel void index_select<uchar>(constant const IndexAB & indexAB       [[buffer(0)]],
+                                constant const void    * indexSizes    [[buffer(1)]],
+                                constant const void    * indexStrides  [[buffer(2)]],
+                                constant const uint3   * offsets       [[buffer(3)]],
+                                constant const void    * inputData     [[buffer(4)]],
+                                device void          * outputData    [[buffer(5)]],
+                                uint thread_index [[thread_position_in_grid]]);
+
+template
+[[host_name("index_select_bool")]]
+kernel void index_select<bool>(constant const IndexAB & indexAB       [[buffer(0)]],
+                               constant const void    * indexSizes    [[buffer(1)]],
+                               constant const void    * indexStrides  [[buffer(2)]],
+                               constant const uint3   * offsets       [[buffer(3)]],
+                               constant const void    * inputData     [[buffer(4)]],
+                               device void          * outputData    [[buffer(5)]],
+                               uint thread_index [[thread_position_in_grid]]);


One doesn't need template per datatype, but rather template per elementsize (unless there is a memcpy like function on metal, than templates aren't needed at all)

Suggested change

template

[[host_name("index_select_float")]]

kernel void index_select<float>(constant const IndexAB & indexAB [[buffer(0)]],

constant const void * indexSizes [[buffer(1)]],

constant const void * indexStrides [[buffer(2)]],

constant const uint3 * offsets [[buffer(3)]],

constant const void * inputData [[buffer(4)]],

device void * outputData [[buffer(5)]],

uint thread_index [[thread_position_in_grid]]);

template

[[host_name("index_select_half")]]

kernel void index_select<half>(constant const IndexAB & indexAB [[buffer(0)]],

constant const void * indexSizes [[buffer(1)]],

constant const void * indexStrides [[buffer(2)]],

constant const uint3 * offsets [[buffer(3)]],

constant const void * inputData [[buffer(4)]],

device void * outputData [[buffer(5)]],

uint thread_index [[thread_position_in_grid]]);

template

[[host_name("index_select_long")]]

kernel void index_select<long>(constant const IndexAB & indexAB [[buffer(0)]],

constant const void * indexSizes [[buffer(1)]],

constant const void * indexStrides [[buffer(2)]],

constant const uint3 * offsets [[buffer(3)]],

constant const void * inputData [[buffer(4)]],

device void * outputData [[buffer(5)]],

uint thread_index [[thread_position_in_grid]]);

template

[[host_name("index_select_int")]]

kernel void index_select<int>(constant const IndexAB & indexAB [[buffer(0)]],

constant const void * indexSizes [[buffer(1)]],

constant const void * indexStrides [[buffer(2)]],

constant const uint3 * offsets [[buffer(3)]],

constant const void * inputData [[buffer(4)]],

device void * outputData [[buffer(5)]],

uint thread_index [[thread_position_in_grid]]);

template

[[host_name("index_select_short")]]

kernel void index_select<short>(constant const IndexAB & indexAB [[buffer(0)]],

constant const void * indexSizes [[buffer(1)]],

constant const void * indexStrides [[buffer(2)]],

constant const uint3 * offsets [[buffer(3)]],

constant const void * inputData [[buffer(4)]],

device void * outputData [[buffer(5)]],

uint thread_index [[thread_position_in_grid]]);

template

[[host_name("index_select_char")]]

kernel void index_select<char>(constant const IndexAB & indexAB [[buffer(0)]],

constant const void * indexSizes [[buffer(1)]],

constant const void * indexStrides [[buffer(2)]],

constant const uint3 * offsets [[buffer(3)]],

constant const void * inputData [[buffer(4)]],

device void * outputData [[buffer(5)]],

uint thread_index [[thread_position_in_grid]]);

template

[[host_name("index_select_uchar")]]

kernel void index_select<uchar>(constant const IndexAB & indexAB [[buffer(0)]],

constant const void * indexSizes [[buffer(1)]],

constant const void * indexStrides [[buffer(2)]],

constant const uint3 * offsets [[buffer(3)]],

constant const void * inputData [[buffer(4)]],

device void * outputData [[buffer(5)]],

uint thread_index [[thread_position_in_grid]]);

template

[[host_name("index_select_bool")]]

kernel void index_select<bool>(constant const IndexAB & indexAB [[buffer(0)]],

constant const void * indexSizes [[buffer(1)]],

constant const void * indexStrides [[buffer(2)]],

constant const uint3 * offsets [[buffer(3)]],

constant const void * inputData [[buffer(4)]],

device void * outputData [[buffer(5)]],

uint thread_index [[thread_position_in_grid]]);

template

[[host_name("index_select_32bit")]]

kernel void index_select<int>(constant const IndexAB & indexAB [[buffer(0)]],

constant const void * indexSizes [[buffer(1)]],

constant const void * indexStrides [[buffer(2)]],

constant const uint3 * offsets [[buffer(3)]],

constant const void * inputData [[buffer(4)]],

device void * outputData [[buffer(5)]],

uint thread_index [[thread_position_in_grid]]);

template

[[host_name("index_select_16bit")]]

kernel void index_select<short>(constant const IndexAB & indexAB [[buffer(0)]],

constant const void * indexSizes [[buffer(1)]],

constant const void * indexStrides [[buffer(2)]],

constant const uint3 * offsets [[buffer(3)]],

constant const void * inputData [[buffer(4)]],

device void * outputData [[buffer(5)]],

uint thread_index [[thread_position_in_grid]]);

template

[[host_name("index_select_64bit")]]

kernel void index_select<long>(constant const IndexAB & indexAB [[buffer(0)]],

constant const void * indexSizes [[buffer(1)]],

constant const void * indexStrides [[buffer(2)]],

constant const uint3 * offsets [[buffer(3)]],

constant const void * inputData [[buffer(4)]],

device void * outputData [[buffer(5)]],

uint thread_index [[thread_position_in_grid]]);

template

[[host_name("index_select_8bit")]]

kernel void index_select<char>(constant const IndexAB & indexAB [[buffer(0)]],

constant const void * indexSizes [[buffer(1)]],

constant const void * indexStrides [[buffer(2)]],

constant const uint3 * offsets [[buffer(3)]],

constant const void * inputData [[buffer(4)]],

device void * outputData [[buffer(5)]],

uint thread_index [[thread_position_in_grid]]);

The 64/32/8 bit templates make sense for indexing, I'll update that. Regarding the use of templates, we'd still need them - there is no memcpy in metal

@malfet , can we follow this up in next PR, with index_put. We didn't want to club these two as it will become a huge PR.

Sure, can you please file a followup issue?

aten/src/ATen/native/mps/operations/Indexing.mm

kulinseth · 2022-08-16T16:24:39Z

@kulinseth at the very least this PR needs a rebase(as dispatch code has already been landed in #82612 )and fix for the linter, leaving a few more comments right now

Rebase and lint issues are fixed.

kulinseth · 2022-08-16T16:44:31Z

@malfet , unrelated to this change. But lintrunner -m master on MacOS has become un-usable. There are so many warnings:

         76  |  if (str == "torch.Tensor") {

  Error (CLANGTIDY) [cppcoreguidelines-init-variables,-warnings-as-errors]
    variable 'ret' is not initialized

        106  |}
        107  |
        108  |std::vector<std::pair<Backend, ScalarType>> all_declared_types() {
    >>> 109  |  std::vector<std::pair<Backend, ScalarType>> ret;
        110  |
        111  |  // NOTE: Do not add more types here. This list controls the creation
        112  |  // of legacy tensor types e.g. torch.cuda.FloatTensor which are

  Error (CLANGTIDY) [cppcoreguidelines-init-variables,-warnings-as-errors]
    variable 'backends' is not initialized

        111  |  // NOTE: Do not add more types here. This list controls the creation
        112  |  // of legacy tensor types e.g. torch.cuda.FloatTensor which are
        113  |  // maintained for backwards-compatibility only.
    >>> 114  |  std::vector<Backend> backends = {
        115  |      Backend::CPU, Backend::CUDA, Backend::SparseCPU, Backend::SparseCUDA};
        116  |  std::vector<ScalarType> scalar_types = {
        117  |      ScalarType::Byte,

  Error (CLANGTIDY) [cppcoreguidelines-init-variables,-warnings-as-errors]
    variable 'scalar_types' is not initialized

        113  |  // maintained for backwards-compatibility only.
        114  |  std::vector<Backend> backends = {
        115  |      Backend::CPU, Backend::CUDA, Backend::SparseCPU, Backend::SparseCUDA};
    >>> 116  |  std::vector<ScalarType> scalar_types = {
        117  |      ScalarType::Byte,
        118  |      ScalarType::Char,
        119  |      ScalarType::Double,



>>> Lint for torch/csrc/utils/throughput_benchmark.cpp:

  Error (CLANGTIDY) [bugprone-branch-clone,-warnings-as-errors]
    if with identical then and else branches

         16  |
         17  |void ThroughputBenchmark::addInput(py::args args, py::kwargs kwargs) {
         18  |  CHECK(script_module_.initialized() ^ module_.initialized());
    >>>  19  |  if (script_module_.initialized()) {
         20  |    script_module_.addInput(std::move(args), std::move(kwargs));
         21  |  } else {
         22  |    CHECK(module_.initialized());

  Error (CLANGTIDY) [cppcoreguidelines-pro-type-member-init,-warnings-as-errors]
    constructor does not initialize these fields: module_

         39  |  }
         40  |}
         41  |
    >>>  42  |ThroughputBenchmark::ThroughputBenchmark(jit::Module script_module)
         43  |    : script_module_(script_module) {}
         44  |
         45  |ThroughputBenchmark::ThroughputBenchmark(py::object module)

  Error (CLANGTIDY) [cppcoreguidelines-pro-type-member-init,-warnings-as-errors]
    constructor does not initialize these fields: module_

         42  |ThroughputBenchmark::ThroughputBenchmark(jit::Module script_module)
         43  |    : script_module_(script_module) {}
         44  |
    >>>  45  |ThroughputBenchmark::ThroughputBenchmark(py::object module)
         46  |    : module_(std::move(module)) {}
         47  |
         48  |BenchmarkExecutionStats ThroughputBenchmark::benchmark(



>>> Lint for torch/utils/jit/__init__.py:

  Warning (FLAKE8) W391
    blank line at end of file
    See https://www.flake8rules.com/rules/W391.html

    >>> 1  |



>>> Lint for torch/utils/data/dataloader.py:

  Error (MYPY) [attr-defined]
    Module has no attribute "sched_getaffinity"

         541  |        cpuset_checked = False
         542  |        if hasattr(os, 'sched_getaffinity'):
         543  |            try:
    >>>  544  |                max_num_worker_suggest = len(os.sched_getaffinity(0))
         545  |                cpuset_checked = True
         546  |            except Exception:
         547  |                pass

that we are unable to find the lint issues locally.

malfet · 2022-08-17T16:15:18Z

aten/src/ATen/mps/MPSDevice.h

  ~MPSDevice();

 private:
  static MPSDevice* _device;
  MTLDevice_t _mtl_device;
+  MTLLibrary_t _mtl_indexing_library;


No, just from code clarify point of view - mtl_indexing_library is implementation detail of Indexing operator and should not leak into MPSDevice. We can implement a RAII mechanism of registering libraries with MPSDevice, but individual implementations IMO do not belong here.

malfet · 2022-08-17T16:15:31Z

aten/src/ATen/mps/IndexKernels.h

+template
+[[host_name("index_select_float")]]
+kernel void index_select<float>(constant const IndexAB & indexAB       [[buffer(0)]],
+                                constant const void    * indexSizes    [[buffer(1)]],
+                                constant const void    * indexStrides  [[buffer(2)]],
+                                constant const uint3   * offsets       [[buffer(3)]],
+                                constant const void    * inputData     [[buffer(4)]],
+                                device void          * outputData    [[buffer(5)]],
+                                uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_half")]]
+kernel void index_select<half>(constant const IndexAB & indexAB       [[buffer(0)]],
+                               constant const void   * indexSizes    [[buffer(1)]],
+                               constant const void   * indexStrides  [[buffer(2)]],
+                               constant const uint3  * offsets       [[buffer(3)]],
+                               constant const void   * inputData     [[buffer(4)]],
+                               device void         * outputData    [[buffer(5)]],
+                               uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_long")]]
+kernel void index_select<long>(constant const IndexAB & indexAB       [[buffer(0)]],
+                               constant const void    * indexSizes    [[buffer(1)]],
+                               constant const void    * indexStrides  [[buffer(2)]],
+                               constant const uint3   * offsets       [[buffer(3)]],
+                               constant const void    * inputData     [[buffer(4)]],
+                               device void          * outputData    [[buffer(5)]],
+                               uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_int")]]
+kernel void index_select<int>(constant const IndexAB & indexAB       [[buffer(0)]],
+                              constant const void    * indexSizes    [[buffer(1)]],
+                              constant const void    * indexStrides  [[buffer(2)]],
+                              constant const uint3   * offsets       [[buffer(3)]],
+                              constant const void    * inputData     [[buffer(4)]],
+                              device void          * outputData    [[buffer(5)]],
+                              uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_short")]]
+kernel void index_select<short>(constant const IndexAB & indexAB       [[buffer(0)]],
+                                constant const void    * indexSizes    [[buffer(1)]],
+                                constant const void    * indexStrides  [[buffer(2)]],
+                                constant const uint3   * offsets       [[buffer(3)]],
+                                constant const void    * inputData     [[buffer(4)]],
+                                device void          * outputData    [[buffer(5)]],
+                                uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_char")]]
+kernel void index_select<char>(constant const IndexAB & indexAB       [[buffer(0)]],
+                               constant const void    * indexSizes    [[buffer(1)]],
+                               constant const void    * indexStrides  [[buffer(2)]],
+                               constant const uint3   * offsets       [[buffer(3)]],
+                               constant const void    * inputData     [[buffer(4)]],
+                               device void          * outputData    [[buffer(5)]],
+                               uint thread_index [[thread_position_in_grid]]);
+template
+[[host_name("index_select_uchar")]]
+kernel void index_select<uchar>(constant const IndexAB & indexAB       [[buffer(0)]],
+                                constant const void    * indexSizes    [[buffer(1)]],
+                                constant const void    * indexStrides  [[buffer(2)]],
+                                constant const uint3   * offsets       [[buffer(3)]],
+                                constant const void    * inputData     [[buffer(4)]],
+                                device void          * outputData    [[buffer(5)]],
+                                uint thread_index [[thread_position_in_grid]]);
+
+template
+[[host_name("index_select_bool")]]
+kernel void index_select<bool>(constant const IndexAB & indexAB       [[buffer(0)]],
+                               constant const void    * indexSizes    [[buffer(1)]],
+                               constant const void    * indexStrides  [[buffer(2)]],
+                               constant const uint3   * offsets       [[buffer(3)]],
+                               constant const void    * inputData     [[buffer(4)]],
+                               device void          * outputData    [[buffer(5)]],
+                               uint thread_index [[thread_position_in_grid]]);


Sure, can you please file a followup issue?

aten/src/ATen/native/mps/operations/Indexing.h

malfet · 2022-08-17T16:27:24Z

aten/src/ATen/native/mps/operations/Indexing.h

+  std::string res = "";
+  switch (scalar_type) {
+    case ScalarType::Float:
+      res = "float"; break;


Nit

Suggested change

res = "float"; break;

return "float";

malfet · 2022-08-17T16:28:10Z

aten/src/ATen/native/mps/operations/Indexing.h

+namespace native {
+namespace mps {
+
+std::string getMetalScalarType(ScalarType scalar_type) {


Nit (or return const std::string&) <- i.e. please return string literal type to avoid unnecessary copy

Suggested change

std::string getMetalScalarType(ScalarType scalar_type) {

const char* getMetalScalarType(ScalarType scalar_type) {

malfet · 2022-08-17T16:29:23Z

aten/src/ATen/native/mps/operations/Indexing.mm

+    AT_ASSERT(num_indices == iter.ntensors() - 2);
+    const Tensor& inputTensor = iter.tensor(1);
+
+    TORCH_CHECK(c10::isIntegralType(inputTensor.scalar_type(), /*includesBool=*/true), 


This would return false for Float and half, which I believe is not an intended behavior.(Also, is it covered by tests right now? If not, then imo it should be)

Suggested change

TORCH_CHECK(c10::isIntegralType(inputTensor.scalar_type(), /*includesBool=*/true),

TORCH_CHECK(inputTensor.scalar_type() == ScalarType::Float ||

inputTensor.scalar_type() == ScalarType::Half ||

c10::isIntegralType(inputTensor.scalar_type(), /*includesBool=*/true),

Yes, float and half are covered in the test cases (latest commit should contain this change):

TORCH_CHECK(c10::isIntegralType(inputTensor.scalar_type(), /*includesBool=*/true) || inputTensor.scalar_type() == ScalarType::Float || inputTensor.scalar_type() == ScalarType::Half,

kulinseth · 2022-08-18T00:31:56Z

======================================================================
[9989](https://github.com/pytorch/pytorch/runs/7888712813?check_suite_focus=true#step:10:9990)
ERROR [0.002s]: test_single_output (__main__.TestAOTAutograd)
[9990](https://github.com/pytorch/pytorch/runs/7888712813?check_suite_focus=true#step:10:9991)
----------------------------------------------------------------------
[9991](https://github.com/pytorch/pytorch/runs/7888712813?check_suite_focus=true#step:10:9992)
Traceback (most recent call last):
[9992](https://github.com/pytorch/pytorch/runs/7888712813?check_suite_focus=true#step:10:9993)
  File "/var/lib/jenkins/workspace/functorch/test/test_pythonkey.py", line 223, in test_single_output
[9993](https://github.com/pytorch/pytorch/runs/7888712813?check_suite_focus=true#step:10:9994)
    self.verify_aot_autograd(f, inp)
[9994](https://github.com/pytorch/pytorch/runs/7888712813?check_suite_focus=true#step:10:9995)
  File "/var/lib/jenkins/workspace/functorch/test/test_pythonkey.py", line 216, in verify_aot_autograd
[9995](https://github.com/pytorch/pytorch/runs/7888712813?check_suite_focus=true#step:10:9996)
    self.assertEqual(ref_out, test_out)
[9996](https://github.com/pytorch/pytorch/runs/7888712813?check_suite_focus=true#step:10:9997)
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2391, in assertEqual
[9997](https://github.com/pytorch/pytorch/runs/7888712813?check_suite_focus=true#step:10:9998)
    y = torch.as_tensor(y, dtype=x.dtype, device=x.device)
[9998](https://github.com/pytorch/pytorch/runs/7888712813?check_suite_focus=true#step:10:9999)
ValueError: only one element tensors can be converted to Python scalars

This error seems unrelated to this PR.

kulinseth · 2022-08-18T00:32:17Z

@pytorchbot rebase

pytorchmergebot · 2022-08-18T00:33:44Z

@pytorchbot successfully started a rebase job. Check the current status here

* Add more tests from test_indexing into test_mps * Cache the indexing library on the MPSDevice

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>

pytorchmergebot · 2022-08-18T00:33:47Z

Successfully rebased indexing onto refs/remotes/origin/master, please pull locally before adding more changes (for example, via git checkout indexing && git pull --rebase)

kulinseth · 2022-08-18T06:02:00Z

@pytorchbot merge

pytorchmergebot · 2022-08-18T06:03:12Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

github-actions · 2022-08-18T06:03:56Z

Hey @kulinseth.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: * Add more tests from test_indexing into test_mps * Cache the indexing library on the MPSDevice Pull Request resolved: #82507 Approved by: https://github.com/malfet Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/ce7177f88a8c76351087bd06520681e60591ff50 Reviewed By: atalman Differential Revision: D38830978 Pulled By: atalman fbshipit-source-id: 69eb9e0a5779cf4d0b0d0c492e1ba210d9ae59bf

kulinseth requested review from dhruvbird, malfet and razarmehr July 29, 2022 22:26

facebook-github-bot added the cla signed label Jul 29, 2022

pytorchbot added the open source label Jul 29, 2022

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 1, 2022

malfet requested changes Aug 1, 2022

View reviewed changes

malfet reviewed Aug 1, 2022

View reviewed changes

malfet mentioned this pull request Aug 8, 2022

[MPS] And native bitwise_[and|or|xor] #82307

Closed

malfet reviewed Aug 16, 2022

View reviewed changes

aten/src/ATen/native/mps/operations/Indexing.mm Outdated Show resolved Hide resolved

kulinseth force-pushed the indexing branch from e3b82b3 to 780b31a Compare August 16, 2022 16:22

kulinseth added ciflow/trunk Trigger trunk jobs on your pull request ciflow/mps Run MPS tests (subset of trunk) labels Aug 16, 2022

kulinseth force-pushed the indexing branch from 5194439 to ce2c786 Compare August 17, 2022 16:16

malfet reviewed Aug 17, 2022

View reviewed changes

malfet approved these changes Aug 17, 2022

View reviewed changes

kulinseth and others added 10 commits August 18, 2022 00:33

[MPS] Register index.Tensor_out

e2eb676

* Add more tests from test_indexing into test_mps * Cache the indexing library on the MPSDevice

Address PR comments (#86)

db6997e

Address remaining PR comments (#87)

7fcc0da

Lint issues and rebase.

9098adf

Update aten/src/ATen/mps/IndexKernels.h

f7753b5

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>

Addressing comments.

ec7c151

Fixing the float and half issues.

43cf588

Fix build failure

8b5e4a8

Remove unused variable

a16d828

Revert third_party changes

132ba61

pytorchmergebot force-pushed the indexing branch from 16106d9 to 132ba61 Compare August 18, 2022 00:33

pytorchmergebot added the Merged label Aug 18, 2022

pytorchmergebot closed this in ce7177f Aug 18, 2022

kulinseth mentioned this pull request Aug 22, 2022

[RFC] Adding custom Metal kernel for mps ops requiring high performance #80437

Closed

DenisVieriu97 mentioned this pull request Sep 12, 2022

General MPS op coverage tracking issue #77764

Open

malfet mentioned this pull request Feb 14, 2023

NotImplementedError: The operator 'aten::index.Tensor' is not current implemented for the MPS device. #88973

Closed

malfet mentioned this pull request Oct 24, 2024

[MPS] Compile kernels into Metallib #138636

Closed

	std::string getMetalScalarType(ScalarType scalar_type) {
	const char* getMetalScalarType(ScalarType scalar_type) {

[MPS] Register index.Tensor_out #82507

[MPS] Register index.Tensor_out #82507

Uh oh!

Conversation

kulinseth commented Jul 29, 2022 • edited by malfet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (5 Pending)

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kulinseth commented Aug 1, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

philipturner commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kulinseth commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philipturner commented Aug 16, 2022

Uh oh!

malfet commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DenisVieriu97 Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DenisVieriu97 Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kulinseth commented Aug 16, 2022

Uh oh!

kulinseth commented Aug 16, 2022

Uh oh!

kulinseth commented Jul 29, 2022 •

edited by malfet

Loading

facebook-github-bot commented Jul 29, 2022 •

edited

Loading

philipturner commented Aug 4, 2022 •

edited

Loading

kulinseth commented Aug 15, 2022 •

edited

Loading

malfet commented Aug 16, 2022 •

edited

Loading

DenisVieriu97 Aug 16, 2022 •

edited

Loading

DenisVieriu97 Aug 16, 2022 •

edited

Loading