[Metal] FP8 vector cast lanes 2/3/4 (extends storage-only FP8) by apstenku123 · Pull Request #39 · tile-ai/tvm

apstenku123 · 2026-05-04T10:09:42Z

Summary

Extends the storage-only Metal FP8 codegen to handle vectorized casts at IR
lanes 2 / 3 / 4. Without this change, every T.Cast("float16x4", fp8_x4)
emitted by upstream TileLang DSL programs raises
LOG(FATAL): Vector FP8 casts (lanes=4) are not yet supported, forcing
callers to manually scalarise the cast and giving up the IR-level vector
type for any subsequent pass (vectorize, fragment-to-simdgroup, etc.).

This PR is the TileLang/tvm submodule half that mirrors the same
change into the vendored TVM codegen at
src/target/source/codegen_metal.{cc,h}. The companion TileLang
supermodule half is filed at
https://github.com/tile-ai/tilelang/pulls (search
cppmega/metal-fp8-vector-cast). Both halves only share helper names;
they can land independently but should be merged in tandem so the
TileLang supermodule and its vendored 3rdparty/tvm checkout stay in sync
with this codepath.

What this changes

Adds an enable_fp8_vector_ codegen flag and a new
PrintFP8VectorPrelude(...) that emits inline MSL helpers that wrap the
existing scalar helpers (__tvm_fp8_e4m3_to_half, etc.) per lane:

inline half4 __tvm_fp8_e4m3_to_half_v4(uchar4 x) {
  return half4(__tvm_fp8_e4m3_to_half(x.x), __tvm_fp8_e4m3_to_half(x.y),
               __tvm_fp8_e4m3_to_half(x.z), __tvm_fp8_e4m3_to_half(x.w));
}

Mirrors are emitted for _v2 / _v3, plus the reverse direction
(half -> fp8) and the e5m2 variant. The compiler is free to scalarise
back into per-lane calls; the goal here is to preserve the IR-level
vector type so subsequent passes can keep their vector loads and stores
and the downstream MSL is uchar4-typed instead of uchar arrays.

Finish() is updated to splice the vector prelude after the scalar
prelude when at least one vector FP8 cast is encountered during codegen.

Wider lanes (8 / 16) keep the existing LOG(FATAL) with a sharper
message — those widths print as uint2 / uint4 packed storage and
need an out-pointer ABI to be wired through; callers should lower them
to scalar casts upstream.

Why Apple Silicon needs software FP8 emulation

Apple Silicon (M1 through M4 Max, including the M5 NAX which is
FP16/INT8 only) has no native FP8 ALU. FP8 is realised as uchar
storage with explicit dequantize-on-load / quantize-on-store; the
encoding mirrors the OCP "OFP8 Formats for Deep Learning" v1.0 spec
(E4M3 finite-only, E5M2 IEEE-style with NaN/Inf).

The vector helpers in this PR are inline-trivial wrappers around the
scalar helpers that landed in the storage-only PR — no new conversion
math. Their value is purely codegen: the IR-level vector type is
preserved so the rest of the lowering pipeline can vectorise.

Path C consumer evidence (vector lanes matter)

The downstream cppmega.mlx project's Path C TileLang FP8 vecmat kernel
(cppmega_mlx/nn/_tilelang/fp8_vecmat_path_c.py) explicitly uses
T.alloc_local((4,), "float8_e4m3") and a T.vectorized(4) inner loop
over packed FP8 weights. Without this PR, that kernel cannot be lowered
on Metal — the FP8 cast inside the K-loop hits the lanes=4 FATAL.
With this PR, the cast lowers and the resulting MSL preserves
uchar4-typed loads through the K-loop.

Dependency

This PR stacks on two prereqs:

tilelang_metal_fp8 storage-only patch (parallel
[Metal] FP8 storage-only emulation (uchar storage + LUT decode helpers)
PR being filed against this same repo). That patch adds
PrintFP8Prelude, enable_fp8_, and the scalar
__tvm_fp8_*_to_half / __tvm_half_to_fp8_* helpers that the
vector helpers in this PR call. Reviewers will need that patch
applied first; the branch in this PR includes it as the first commit
[Metal] FP8 storage-only emulation ... [prereq] for self-contained
review.
Base commit 0e15b274bce8b46f971abf5ac390e844aa6acee5 (the
submodule pin used by tile-ai/tilelang's
metal-gemm-upstream-rebase branch).

When the storage-only PR merges, the prereq commit on this branch
should be rebased away. Before that, this branch is reviewable as
2-commits stacked.

Test plan

git apply --check clean against base
0e15b274bce8b46f971abf5ac390e844aa6acee5 with the storage-only
prereq applied first
git apply --reverse --check clean for both commits in sequence
(round-trip verified)
xcrun --sdk macosx metal -c compile of any prim_func with vector
FP8 cast (lanes 2/3/4) lowers to MSL using the new vector helpers,
not scalar fallback
Direct probe /tmp/test_fp8_vector_cast.py: lanes=4 cast lowers
and the resulting MSL contains __tvm_fp8_e4m3_to_half_v4 with
uchar4 typed loads
cppmega.mlx tilelang test suite: 134 passed, 0 regressions

…p types

)

…omposition (apache#18399)

…em raised in the codegen test for cuda (apache#18398) * fix the 8-bit vector loads/stores so each lane is addressed using reinterpret_cast byte indexing, instead of rolled bit packing, which will omit certain bits. * fix clang format

…ry ops tests (apache#18400) * finish1 * finish2 * finish3 * finish4

This PR addresses the issue where tvm.tir.exp does not support integer types (e.g., int32, int64), causing an InternalError during LLVM code generation with the message. The issue arises because the llvm.exp intrinsic expects floating-point inputs, but no type conversion is performed for integer inputs. This change aligns the behavior of tir.exp with libraries like PyTorch and NumPy, which implicitly convert integer inputs to floating-point types for their exponential functions. Fix apache#18381

apache#18401) * finish1 * finish2 * finish4 * finish5

…ps tests(1) (apache#18402) * finish1 * finish2

…ps tests(2) (apache#18403)

* Fixing database bug * Fix lit gemini error

…tilelang_main

This PR bumps tvm-ffi to latest

…pache#18404)

…ps tests(3) (apache#18410) * finish1 * finish2 * finish3

…es (apache#18412) * Replace relax.build with tvm.compile in export script * Remove unnecessary print statement in export script Remove print statement for skipping model conversion. * Update output handling for TVM results

…tilelang_main

…ps tests(4) (apache#18414)

… importing ONNX model using Relax frontend (apache#18416) [apache#18397] Fix bug: Unsupported numpy or ml_dtypes dtype('O') when importing ONNX model using Relax frontend Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>

…ps tests(5) (apache#18417) * f1 * f2 * f3 * f5 * f7

…t cases (apache#18419) [apache#17640] Refactor: remove the depreation warning from test cases Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>

…ps tests(6) (apache#18420) * finish1 * finish2

…deployment workflow (apache#18413) This PR modernizes the cross-compilation and RPC tutorial by adding a complete PyTorch/Relax deployment workflow alongside the existing TE examples.

…tilelang_main

- Added PrintIndent call in PrintSSAAssign to improve code formatting. - Removed unnecessary scope management in VisitExpr_ for better clarity and performance.

- Increased the timeout limit in SetRLimit from 10,000 to 100,000 for improved performance. - Added detailed logging in the CanProve method to trace the evaluation process and results of the Z3 solver.

- Reduced the timeout limit in SetRLimit from 100,000 to 10,000 for improved control over execution time. - Fixed formatting inconsistencies in comments and code for better readability.

- Improved the handling of nested conditions in the if_then_else construct to prevent out-of-bounds access by combining outer select conditions. - Added a stack to manage select conditions during code generation, ensuring proper evaluation order and safety. - Updated comments for clarity and better understanding of the changes made.

…ation - Introduced CountSatisfyingValues method in both Z3Prover implementations to count distinct integer values satisfying current constraints using Z3's model enumeration. - Added detailed documentation for the new method, explaining parameters and return values. - Implemented basic error handling for unsatisfiable conditions and minimum consecutive value requirements. - Updated the Z3Prover interface to include the new method, ensuring compatibility with existing functionality.

- Updated the logic for handling read and write access in the VisitExpr_ method to treat read and write masks more conservatively. - This change simplifies the access region detection process, allowing for better handling of common patterns like atomic read-modify-write without requiring manual annotations from users. - Improved code clarity by restructuring conditional checks for read and write access updates.

- Added logic to eliminate bounded offsets in comparisons involving expressions of the form (base + offset) when offset is known to be within a specific range. - Implemented helper functions to determine if expressions are multiples of a given factor and to simplify comparisons based on modular analysis. - Updated tests to cover new simplification cases for aligned values, ensuring correctness of the new logic.

…tilelang_main

…ctory handling - Introduced a new helper function `_resolve_artifact_paths` in `nvcc.py` to streamline the management of temporary file paths for CUDA compilation. - Enhanced the `TempDirectory` class in `utils.py` to ensure thread-safe creation of debug temporary directories, preventing race conditions in multi-process scenarios. - Updated tests in `test_util.py` to validate the new debug directory handling and ensure robustness in concurrent environments.

- Introduced EnsureCurrentDeviceContext to ensure the correct CUDA context is set for the current thread before executing device-specific operations. - Updated multiple methods in CUDAModuleNode and CUDAWrappedFunc to call this new function, enhancing thread safety and context management during multi-GPU execution.

* feat: add bfloat16x2 types * fix: make less diff

- Added a check to validate that grid dimensions are positive before launching CUDA kernels, improving error handling for dynamic shapes that may result in zero dimensions. - Simplified work size assignment in thread storage scope to remove unnecessary checks for dynamic shapes.

…nd (apache#30) cuLaunchKernel is asynchronous and its return value does not capture runtime errors such as illegal memory access. Add cudaPeekAtLastError() after the launch to detect these errors, matching the Cython backend's TILELANG_CHECK_LAST_ERROR behavior. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* Add tfloat32 datatype * fix: change tfloat32 type code to 130 * minor fix

* Fix TVMDerivedObject slots for apache-tvm-ffi compatibility Add __slots__ = ("_inst", "__weakref__") to the dynamically created TVMDerivedObject class inside the derived_object decorator. The class inherits from CObject (apache-tvm-ffi), a C extension type with __slots__ = () and no instance __dict__. Without explicit __slots__, setting self._inst in __init__ raises AttributeError, and weakref.ref(self) fails because __weakref__ is not available. Root cause: tilelang migrated from a custom TVM fork to apache-tvm-ffi (October 2025). The old fork's Object type allowed arbitrary instance attributes; the new CObject does not. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix TVMDerivedObject slots in meta_schedule/utils.py duplicate Apply the same __slots__ fix to the second copy of derived_object in meta_schedule/utils.py. Most @derived_object users (LocalRunner, LocalBuilder, cost models, etc.) import from this copy, not runtime/support.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…che#33)

…s) [prereq]

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

…, #37/#38/#39) Three parallel agents completed the supermodule/submodule split filing: 1. tilelang_metal_fp8 (storage-only FP8 emulation) split: - 0001-tilelang-metal-fp8-storage-only.patch — supermodule half (235 lines) - 0002-tvm-metal-fp8-storage-only.patch — TVM-mirror half (260 lines, prefix stripped) - PR tile-ai/tilelang#2144 (supermodule, stacks on PR #2130) - PR tile-ai/tvm#38 (TVM mirror, base tilelang_main @ 0e15b274) 2. tilelang_metal_fp8_vector (vector cast lanes 2/3/4) split: - 0001-tilelang-metal-fp8-vector-cast.patch — supermodule half (148 lines) - 0002-tvm-metal-fp8-vector-cast.patch — TVM-mirror half (151 lines) - PR tile-ai/tilelang#2145 (supermodule, depends on #2144) - PR tile-ai/tvm#39 (TVM mirror, depends on #38) 3. PR #2143 TVM-mirror companion: - PR tile-ai/tvm#37 — already filed, README updated to link both halves Total filed today: 11 PRs across 3 repos - 1 ml-explore/mlx (#3476) - 1 apache/tvm (#19504) - 6 tile-ai/tilelang (#2139, #2140, #2141, #2142, #2143 super, #2144 super, #2145 super) - 3 tile-ai/tvm (#37, #38, #39 — TVM-mirror companions) PR #2142 (T.fp8_scaled_matmul) has no TVM-mirror companion needed — verified the patch only touches supermodule files. All splits round-trip clean (apply forward + reverse) on their respective bases. README files in each docs/upstream/<dir>/ updated with PR URLs and dependency-chain diagrams. Note: TileLang/tvm redirects to tile-ai/tvm server-side (canonical org slug). All TVM-mirror PRs land at tile-ai/tvm/pull/N URLs.

LeiWang1999 and others added 30 commits October 22, 2025 21:40

merge rebase

548cadf

rebase fix

d7f0118

rebase fix

a78f6f5

Remove dlpack submodule

ea34153

rebasefix

3345fde

Refactor GCD computation and update annotation merging to use ffi::Ma…

3085bc4

…p types

bug fix

0f1ebab

Add VisitStmt_ method for AssertStmtNode and StringImmNode (apache#18389

e28b510

)

[Relax][PyTorch] Add run_ep_decomposition flag to control PyTorch dec…

9274761

…omposition (apache#18399)

[Relax][PyTorch] Add support for decomposed operators in extended una…

d3bb716

…ry ops tests (apache#18400) * finish1 * finish2 * finish3 * finish4

[Relax][PyTorch] Enable decomposition for unary ops and refactor tests (

68854d6

apache#18401) * finish1 * finish2 * finish4 * finish5

[Relax][PyTorch] Add support for decomposed operators and fix IR of o…

f532b89

…ps tests(1) (apache#18402) * finish1 * finish2

[Relax][PyTorch] Add support for decomposed operators and fix IR of o…

be37afd

…ps tests(2) (apache#18403)

Fixing database bug (apache#18409)

bda2ec7

* Fixing database bug * Fix lit gemini error

fix type linting warning in tl.float32

00f7d7b

Merge branch 'tilelang_main' of https://github.com/TileLang/tvm into …

fa576ec

…tilelang_main

[FFI] Bump tvm-ffi to latest (apache#18411)

c4d01b5

This PR bumps tvm-ffi to latest

[DOCS] Add tutorial for exporting and loading back Relax executables (a…

36a4696

…pache#18404)

[Relax][PyTorch] Add support for decomposed operators and fix IR of o…

9249061

…ps tests(3) (apache#18410) * finish1 * finish2 * finish3

Add support for finding PrimFuncFrame in buffer allocation

a57f651

Merge branch 'tilelang_main' of https://github.com/TileLang/tvm into …

1815c3e

…tilelang_main

[Relax][PyTorch] Add support for decomposed operators and fix IR of o…

5ca61bb

…ps tests(4) (apache#18414)

[Relax][PyTorch] Add support for decomposed operators and fix IR of o…

03d55df

…ps tests(5) (apache#18417) * f1 * f2 * f3 * f5 * f7

[TEST] Refactor: remove the deprecated warning message check from tes…

8ab96af

…t cases (apache#18419) [apache#17640] Refactor: remove the depreation warning from test cases Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>

[Relax][PyTorch] Add support for decomposed operators and fix IR of o…

ae83984

…ps tests(6) (apache#18420) * finish1 * finish2

[DOCS] Update cross-compilation and RPC tutorial with modern PyTorch …

3d13658

…deployment workflow (apache#18413) This PR modernizes the cross-compilation and RPC tutorial by adding a complete PyTorch/Relax deployment workflow alongside the existing TE examples.

kurisu6912 and others added 27 commits January 16, 2026 15:08

Merge branch 'tilelang_main' of https://github.com/TileLang/tvm into …

8e0cfc0

…tilelang_main

Fix missing is_assume in analyzer (apache#25)

354eef9

Refactor SSA assignment handling in code generation

34fac30

- Added PrintIndent call in PrintSSAAssign to improve code formatting. - Removed unnecessary scope management in VisitExpr_ for better clarity and performance.

Update Z3 prover timeout and enhance logging in CanProve method

bc955a1

- Increased the timeout limit in SetRLimit from 10,000 to 100,000 for improved performance. - Added detailed logging in the CanProve method to trace the evaluation process and results of the Z3 solver.

remove unnecessary log from z3

8f60c1f

Update Z3 prover timeout and fix formatting issues

69db964

- Reduced the timeout limit in SetRLimit from 100,000 to 10,000 for improved control over execution time. - Fixed formatting inconsistencies in comments and code for better readability.

add side effect check in const int bound

930c59f

Merge branch 'tilelang_main' of https://github.com/TileLang/tvm into …

76277e9

…tilelang_main

Support cluster launch for cuda runtime

47c0af2

support isinfinite

ccf2753

Merge commit '47c0af2' into tilelang_main

edc1c15

fix

5c193e1

feat: add bfloat16x2 types (apache#29)

fab43e4

* feat: add bfloat16x2 types * fix: make less diff

Add tfloat32 datatype (apache#31)

1c5cdb8

* Add tfloat32 datatype * fix: change tfloat32 type code to 130 * minor fix

fix llvm compile error (apache#28)

cb75d83

Fix duplicate __weakref__ declaration in derived_object wrappers (apa…

0e15b27

…che#33)

[Metal] FP8 storage-only emulation (uchar storage + LUT decode helper…

9f9f2ea

…s) [prereq]

[Metal] FP8 vector cast lanes 2/3/4 (extends storage-only FP8)

6ca0cc2

Copilot AI review requested due to automatic review settings May 4, 2026 10:09

Copilot AI reviewed May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metal] FP8 vector cast lanes 2/3/4 (extends storage-only FP8)#39

[Metal] FP8 vector cast lanes 2/3/4 (extends storage-only FP8)#39
apstenku123 wants to merge 405 commits intotile-ai:mainfrom
apstenku123:cppmega/metal-fp8-vector-cast

apstenku123 commented May 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

apstenku123 commented May 4, 2026

Summary

What this changes

Why Apple Silicon needs software FP8 emulation

Path C consumer evidence (vector lanes matter)

Dependency

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants