Skip to content

[TIR] Add cooperative_tensor builtins and metal.cooperative_tensor storage scope#34

Open
oraluben wants to merge 4 commits intotile-ai:tilelang_mainfrom
oraluben:metal-cooperative-tensor
Open

[TIR] Add cooperative_tensor builtins and metal.cooperative_tensor storage scope#34
oraluben wants to merge 4 commits intotile-ai:tilelang_mainfrom
oraluben:metal-cooperative-tensor

Conversation

@oraluben
Copy link
Copy Markdown

Summary

  • Add 4 TIR builtins for Metal cooperative_tensor operations (MetalPerformancePrimitives / Metal 4):
    • cooperative_tensor_fill
    • cooperative_tensor_load
    • cooperative_tensor_store
    • cooperative_tensor_multiply_accumulate
  • Add metal.cooperative_tensor storage scope (StorageRank::kMetalCooperativeTensor)

Motivation

MetalPerformancePrimitives (MPP) provides matmul2d with cooperative_tensor operands that route to NAX tensor cores on Apple M5 and fall back to simdgroup matrix on M1-M4. These TIR builtins enable Metal backend codegen to emit MPP calls, analogous to the existing simdgroup_* builtins for the older Metal simdgroup matrix API.

Changes

  • include/tvm/tir/builtin.h — 4 new Op declarations
  • src/tir/op/builtin.cc — 4 new Op registrations
  • python/tvm/tir/op.py — Python wrapper functions
  • python/tvm/script/ir_builder/tir/ir.py — Script parser exports + __all__
  • src/runtime/thread_storage_scope.hkMetalCooperativeTensor StorageRank + scope string parsing

Companion tilelang PR uses these builtins for Metal GEMM codegen targeting MPP matmul2d.

@oraluben oraluben force-pushed the metal-cooperative-tensor branch 2 times, most recently from d1bc7ee to afe168b Compare April 20, 2026 07:18
…orage scope

Add TIR builtins for Metal cooperative_tensor operations (MetalPerformancePrimitives):
- cooperative_tensor_fill: fill a cooperative_tensor with a value
- cooperative_tensor_load: load from device/threadgroup memory
- cooperative_tensor_store: store to device/threadgroup memory
- cooperative_tensor_multiply_accumulate: matrix multiply-accumulate via matmul2d

Add metal.cooperative_tensor storage scope (StorageRank::kMetalCooperativeTensor)
for buffers backed by MPP cooperative_tensor registers, analogous to the existing
metal.simdgroup scope but targeting the Metal 4 tensor operations API.

These primitives enable code generation for MetalPerformancePrimitives matmul2d,
which routes to NAX tensor cores on Apple M5 and falls back to simdgroup matrix
instructions on M1-M4.
MTLLanguageVersion4_0 is only available in macOS 26+ SDK. Fall back
to 3_1 (macOS 14+) or 3_0 for older SDKs to fix CI builds.
@oraluben oraluben force-pushed the metal-cooperative-tensor branch from 2afb490 to 29e5bc8 Compare April 26, 2026 03:21
@oraluben
Copy link
Copy Markdown
Author

oraluben commented Apr 26, 2026

Downstream PR: tile-ai/tilelang#1869 ([Metal] Add Metal GEMM support with cooperative_tensor MMA)

This PR adds the TIR builtins (cooperative_tensor_fill, cooperative_tensor_load, cooperative_tensor_store, cooperative_tensor_multiply_accumulate) and metal.cooperative_tensor storage scope that tile-ai/tilelang#1869 depends on.

Regarding testing: the TIR builtins are exercised end-to-end through tilelang's Metal GEMM tests (18 tests covering correctness across various tile configurations). We're not sure what additional TVM-level tests would be appropriate for these builtins — if there's a specific test pattern you'd like us to add (e.g. TIR script roundtrip tests, or builtin registration checks), we're happy to do so.

@oraluben
Copy link
Copy Markdown
Author

Upstream PR: apache#19423 (same change targeting apache/tvm main)

The cython backend uses strict __slots__ on ObjectBase, which
prevents setting _inst on PyFunctionPass without declaring it.
This works with ctypes backend (which has __dict__) but fails
on CI with cython.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant