[AMD][Gfx950] Add the support of 160K LDS and copy.async #2058
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughAdded a gfx950-only 128-bit async buffer→LDS load helper and switched Changes
Sequence Diagram(s)sequenceDiagram
participant Kernel
participant AsyncASM as Async ASM
participant GlobalMem as Global Memory
participant LDS
Kernel->>AsyncASM: call async_buffer_load_dwordx4_v(smem, rsrc, voffset)
AsyncASM->>GlobalMem: buffer_load_dwordx4 ... offen (m0 set, ptr from smem)
GlobalMem-->>AsyncASM: 128-bit data
AsyncASM->>LDS: write 128-bit data directly to LDS (smem pointer)
Kernel->>LDS: subsequent reads use loaded data
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/tl_templates/hip/copy.h (1)
79-89: Consider matchingasync_buffer_load_dword_v'spre_noptemplate for consistency.
async_buffer_load_dword_vis defined withtemplate <bool pre_nop = false>(Line 64). The newasync_buffer_load_dwordx4_vomits this. If not needed, fine; otherwise adding the same template shell keeps both helpers symmetric for futures_nopinsertion needs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tl_templates/hip/copy.h` around lines 79 - 89, The helper async_buffer_load_dwordx4_v is missing the template<bool pre_nop = false> used by async_buffer_load_dword_v; update async_buffer_load_dwordx4_v to the same template signature and, where async_buffer_load_dword_v conditionally emits an s_nop when pre_nop is true, add the same conditional s_nop insertion before the asm in async_buffer_load_dwordx4_v so both helpers are symmetric (refer to async_buffer_load_dword_v, async_buffer_load_dwordx4_v, pre_nop and s_nop to locate the changes).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tilelang/carver/arch/cdna.py`:
- Line 26: Fix the typo in the comment that reads "TVM runtime should orrectly
report 160 KB (163840 B) for gfx950; the" by changing "orrectly" to "correctly"
so the comment reads "TVM runtime should correctly report 160 KB (163840 B) for
gfx950; the". Reference the exact comment string containing "orrectly" to locate
the change in cdna.py.
---
Nitpick comments:
In `@src/tl_templates/hip/copy.h`:
- Around line 79-89: The helper async_buffer_load_dwordx4_v is missing the
template<bool pre_nop = false> used by async_buffer_load_dword_v; update
async_buffer_load_dwordx4_v to the same template signature and, where
async_buffer_load_dword_v conditionally emits an s_nop when pre_nop is true, add
the same conditional s_nop insertion before the asm in
async_buffer_load_dwordx4_v so both helpers are symmetric (refer to
async_buffer_load_dword_v, async_buffer_load_dwordx4_v, pre_nop and s_nop to
locate the changes).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0a1088c0-9643-438b-aa90-b12f49002ea2
📒 Files selected for processing (2)
src/tl_templates/hip/copy.htilelang/carver/arch/cdna.py
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@testing/python/amd/test_tilelang_gfx950_copy_async.py`:
- Around line 89-109: The test test_gfx950_cp_async_gs_16_in_codegen currently
only asserts the generic wrapper "cp_async_gs<16>" is present; update it to
assert the gfx950-specific lowering is emitted by checking
kernel.get_kernel_source() for the direct-to-LDS instruction(s) introduced by
the PR (e.g., the "buffer_load_dwordx4" pattern and usage of "lds" in the
emitted HIP) or, if the PR adds a helper symbol, assert that helper symbol name
appears; specifically, replace or add to the existing assert that searches for
"cp_async_gs<16>" with an assertion that the source contains the gfx950
instruction sequence (e.g., "buffer_load_dwordx4" and "lds") so the test
verifies the gfx950 path is used.
- Around line 141-169: The parametrization for
test_gfx950_copy_async_gemm_pipelined only uses k_pack=1 so coalesced_width is
always 4 and the 16-byte gfx950 path (cp_async_gs<16>) is never exercised;
update the param list in the test (the pytest.mark.parametrize tuple for
"trans_A, trans_B, k_pack") to include at least one case with k_pack=2 (which
yields coalesced_width=8) so the call to _matmul_kernel(...) will exercise the
128-bit/cp_async_gs<16> path and catch any data corruption there.
- Around line 23-30: Replace the manual mcpu detection in _is_gfx950() with the
existing helper target_is_gfx950() from tilelang.utils.target (import and call
target_is_gfx950()) or, if you prefer explicit targets, construct
tvm.target.Target("rocm", options="-mcpu=gfx950") when detecting/creating
targets used by test_gfx950_smem_cap_160kb; also remove the broad "except
Exception" so errors surface instead of being swallowed. Use the symbol
target_is_gfx950 (or explicit Target(...) with options) and update any calls
that currently use _is_gfx950 or Target("rocm") in this file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4dbae2b9-21c7-419b-b185-9efa681037c5
📒 Files selected for processing (1)
testing/python/amd/test_tilelang_gfx950_copy_async.py
| def _is_gfx950() -> bool: | ||
| try: | ||
| from tilelang import tvm | ||
|
|
||
| mcpu = str(tvm.target.Target("rocm").attrs.get("mcpu", "")) | ||
| return "gfx950" in mcpu | ||
| except Exception: | ||
| return False |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the test file to understand the full context
cat -n testing/python/amd/test_tilelang_gfx950_copy_async.pyRepository: tile-ai/tilelang
Length of output: 8695
🏁 Script executed:
# Check if there are other device detection helpers in the codebase
fd -t f -i "device\|detect\|target" testing/python | head -20
rg -i "is_gfx\|device_detection\|get.*mcpu" testing/python -A 2 -B 2Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Look at the CDNA implementation mentioned in the comment
cat -n tilelang/carver/arch/cdna.py | head -60Repository: tile-ai/tilelang
Length of output: 2179
🏁 Script executed:
# Search for how Target is used elsewhere in tests
rg "Target\(.*rocm" testing/ -A 2 -B 2 | head -40Repository: tile-ai/tilelang
Length of output: 888
🏁 Script executed:
# Search for Target usage with explicit -mcpu in tests
rg "Target\(" testing/ -A 2 | grep -i "mcpu\|rocm" -B 2 -A 2Repository: tile-ai/tilelang
Length of output: 621
🏁 Script executed:
# Look for TVM documentation or examples of device detection
find . -name "*.py" -type f -exec grep -l "target.*mcpu\|mcpu.*target" {} \; | head -10Repository: tile-ai/tilelang
Length of output: 173
🏁 Script executed:
# Check if there are any device query mechanisms that explicitly report gfx950
rg "gfx950\|MI350" . -t py --iglob "*.py" | grep -v test | head -20Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Look at how the requires_rocm decorator works
rg "requires_rocm" . -B 5 -A 10 | head -60Repository: tile-ai/tilelang
Length of output: 4109
🏁 Script executed:
# Let's check the target utilities file
cat -n tilelang/utils/target.pyRepository: tile-ai/tilelang
Length of output: 10181
🏁 Script executed:
# Search for how TVM Target auto-detects attributes
rg "rocm\|mcpu" tilelang/utils/ -B 2 -A 2Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check if there are any explicit examples with -mcpu in the codebase
rg "\-mcpu=\|mcpu.*=" . -t py | head -20Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check if target_is_gfx950 is used elsewhere in tests
rg "target_is_gfx950" testing/ -B 2 -A 2Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check what the FFI API returns
rg "TargetIsGfx950" . -B 2 -A 2Repository: tile-ai/tilelang
Length of output: 1169
🏁 Script executed:
# Look at the imports in the test file more carefully
head -20 testing/python/amd/test_tilelang_gfx950_copy_async.pyRepository: tile-ai/tilelang
Length of output: 796
🏁 Script executed:
# Check if there are any other examples of hardware detection in AMD tests
ls -la testing/python/amd/Repository: tile-ai/tilelang
Length of output: 515
🏁 Script executed:
# Search for explicit target string construction in other AMD tests
rg "Target\(" testing/python/amd/ -B 1 -A 1Repository: tile-ai/tilelang
Length of output: 551
🏁 Script executed:
# Check the C++ implementation to understand TargetIsGfx950 behavior
cat -n src/target/utils.cc | grep -A 10 "TargetIsGfx950"Repository: tile-ai/tilelang
Length of output: 1153
🏁 Script executed:
# Verify if target_is_gfx950 is exported from tilelang
rg "target_is_gfx950|from.*target import" tilelang/__init__.py -B 2 -A 2Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check if the test file ever uses the built-in target_is_gfx950 function
rg "target_is_gfx950\|from tilelang.utils.target" testing/python/amd/test_tilelang_gfx950_copy_async.pyRepository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check what happens when Target is created without explicit options - look at TVM docs or comments
rg "Target.*rocm\|Target.*hip" tilelang/ -B 1 -A 3 | head -40Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Search for how other tests use gfx950 detection or target_is_gfx950
rg "gfx950\|target_is_gfx950\|TargetIsGfx950" testing/ -B 2 -A 2Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check if there are any explicit mcpu examples in other files
rg "\-mcpu\|mcpu=" . --type py | head -15Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check what attributes Target("rocm") actually gets
python3 << 'EOF'
try:
# Simple test to understand TVM behavior
from tvm.target import Target
t = Target("rocm")
print("Target attrs keys:", list(t.attrs.keys()))
print("mcpu value:", t.attrs.get("mcpu", "NOT_PRESENT"))
print("Full attrs:", dict(t.attrs))
except Exception as e:
print(f"Error: {e}")
print("TVM may not be available in sandbox")
EOFRepository: tile-ai/tilelang
Length of output: 124
🏁 Script executed:
# Look at the tilelang/__init__.py to see what's exported
grep -n "target_is_gfx950\|from.*utils.target" tilelang/__init__.pyRepository: tile-ai/tilelang
Length of output: 42
Use target_is_gfx950() helper instead of manual mcpu detection.
Lines 27 and 123 construct Target("rocm") without explicit -mcpu=gfx950. When mcpu is not present in target attributes, _is_gfx950() returns False and test_gfx950_smem_cap_160kb takes the non-gfx950 branch, bypassing validation of the 160 KB override logic. Use the existing target_is_gfx950() helper from tilelang.utils.target instead, or explicitly construct targets with -mcpu=gfx950. Also remove the broad except Exception to avoid silently hiding target-detection failures.
🧰 Tools
🪛 Ruff (0.15.10)
[warning] 29-29: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@testing/python/amd/test_tilelang_gfx950_copy_async.py` around lines 23 - 30,
Replace the manual mcpu detection in _is_gfx950() with the existing helper
target_is_gfx950() from tilelang.utils.target (import and call
target_is_gfx950()) or, if you prefer explicit targets, construct
tvm.target.Target("rocm", options="-mcpu=gfx950") when detecting/creating
targets used by test_gfx950_smem_cap_160kb; also remove the broad "except
Exception" so errors surface instead of being swallowed. Use the symbol
target_is_gfx950 (or explicit Target(...) with options) and update any calls
that currently use _is_gfx950 or Target("rocm") in this file.
| @tilelang.testing.requires_rocm | ||
| def test_gfx950_cp_async_gs_16_in_codegen(): | ||
| """coalesced_width=8 (16 bytes) must emit cp_async_gs<16> in generated HIP source.""" | ||
| prog = _matmul_kernel( | ||
| 256, | ||
| 256, | ||
| 256, | ||
| 128, | ||
| 128, | ||
| 32, | ||
| False, | ||
| True, | ||
| T.float16, | ||
| T.float32, | ||
| T.float32, | ||
| num_stages=2, | ||
| coalesced_width=8, # 8 fp16 = 16 bytes → cp_async_gs<16> | ||
| ) | ||
| kernel = tl.compile(prog, out_idx=[2]) | ||
| src = kernel.get_kernel_source() | ||
| assert "cp_async_gs<16>" in src, "Expected cp_async_gs<16> in generated HIP source for 128-bit async copy path" |
There was a problem hiding this comment.
Assert the gfx950-specific async load, not just the wrapper call.
cp_async_gs<16> only proves the generic 16-byte wrapper was emitted. It does not prove the gfx950 path lowers to buffer_load_dwordx4 ... lds, which is the behavior this PR adds. On gfx950, add an assertion against the generated/emitted source that contains the direct-to-LDS instruction, or add a lower-level codegen check for the helper implementation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@testing/python/amd/test_tilelang_gfx950_copy_async.py` around lines 89 - 109,
The test test_gfx950_cp_async_gs_16_in_codegen currently only asserts the
generic wrapper "cp_async_gs<16>" is present; update it to assert the
gfx950-specific lowering is emitted by checking kernel.get_kernel_source() for
the direct-to-LDS instruction(s) introduced by the PR (e.g., the
"buffer_load_dwordx4" pattern and usage of "lds" in the emitted HIP) or, if the
PR adds a helper symbol, assert that helper symbol name appears; specifically,
replace or add to the existing assert that searches for "cp_async_gs<16>" with
an assertion that the source contains the gfx950 instruction sequence (e.g.,
"buffer_load_dwordx4" and "lds") so the test verifies the gfx950 path is used.
| @pytest.mark.parametrize( | ||
| "trans_A, trans_B, k_pack", | ||
| [ | ||
| (False, False, 1), | ||
| (False, True, 1), | ||
| (True, True, 1), | ||
| (True, False, 1), | ||
| ], | ||
| ) | ||
| @tilelang.testing.requires_rocm | ||
| def test_gfx950_copy_async_gemm_pipelined(trans_A, trans_B, k_pack): | ||
| """Pipelined GEMM (num_stages=2) with gfx950 copy.async must be numerically correct.""" | ||
| prog = _matmul_kernel( | ||
| 512, | ||
| 512, | ||
| 512, | ||
| 128, | ||
| 128, | ||
| 32, | ||
| trans_A, | ||
| trans_B, | ||
| T.float16, | ||
| T.float32, | ||
| T.float32, | ||
| num_stages=2, | ||
| threads=128, | ||
| k_pack=k_pack, | ||
| coalesced_width=4 * k_pack, | ||
| ) |
There was a problem hiding this comment.
Exercise the 16-byte path in the correctness test.
All k_pack values are 1, so Line 168 always sets coalesced_width=4 and never reaches the cp_async_gs<16> path. Add at least one k_pack=2 / coalesced_width=8 correctness case so data corruption in the new 128-bit gfx950 copy path is caught.
Proposed test coverage adjustment
[
(False, False, 1),
(False, True, 1),
(True, True, 1),
(True, False, 1),
+ # Exercise coalesced_width=8 -> cp_async_gs<16>.
+ (False, False, 2),
],
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@testing/python/amd/test_tilelang_gfx950_copy_async.py` around lines 141 -
169, The parametrization for test_gfx950_copy_async_gemm_pipelined only uses
k_pack=1 so coalesced_width is always 4 and the 16-byte gfx950 path
(cp_async_gs<16>) is never exercised; update the param list in the test (the
pytest.mark.parametrize tuple for "trans_A, trans_B, k_pack") to include at
least one case with k_pack=2 (which yields coalesced_width=8) so the call to
_matmul_kernel(...) will exercise the 128-bit/cp_async_gs<16> path and catch any
data corruption there.
1607dff to
acf714e
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (3)
testing/python/amd/test_tilelang_gfx950_copy_async.py (3)
141-169:⚠️ Potential issue | 🟠 MajorAdd a correctness case for the 16-byte async-copy path.
All current cases use
k_pack=1, so Line 168 keepscoalesced_width=4and never exercisescp_async_gs<16>. Add at least onek_pack=2case.Proposed coverage addition
[ (False, False, 1), (False, True, 1), (True, True, 1), (True, False, 1), + # coalesced_width=8 -> cp_async_gs<16>. + (False, False, 2), ],🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testing/python/amd/test_tilelang_gfx950_copy_async.py` around lines 141 - 169, The test only uses k_pack=1 so coalesced_width=4 never exercises the 16-byte async-copy path; update the parameterization for test_gfx950_copy_async_gemm_pipelined to include at least one case with k_pack=2 (for example add a tuple (False, False, 2) or add 2 to the k_pack choices) so that _matmul_kernel(...) is invoked with coalesced_width=8 and triggers the cp_async_gs<16> code path.
23-30:⚠️ Potential issue | 🟠 MajorUse a real gfx950 target detector for the smem-cap assertion.
Target("rocm")can lackmcpu, so_is_gfx950()returnsFalseand the gfx950 branch is bypassed on the hardware this test is meant to validate. Prefer the existing target utility or pass an explicit-mcpu=gfx950target; also avoid swallowing detection failures with broadexcept Exception.Run this read-only check to confirm the helper signature and remaining manual
mcpuprobes:#!/bin/bash set -euo pipefail # Expect: target_is_gfx950 helper definition/callable wrapper is visible. rg -n -C4 '\btarget_is_gfx950\b|TargetIsGfx950' tilelang src testing # Expect: this test no longer relies on Target("rocm").attrs["mcpu"] for gfx950 detection. rg -n -C4 'Target\("rocm"\)|attrs\.get\("mcpu"' testing/python/amd/test_tilelang_gfx950_copy_async.py tilelang/carver/arch/cdna.pyAlso applies to: 123-127
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testing/python/amd/test_tilelang_gfx950_copy_async.py` around lines 23 - 30, The _is_gfx950() helper uses Target("rocm").attrs.get("mcpu", "") and swallows all errors with a broad except, causing false negatives; replace its logic to call the existing target utility (e.g. target_is_gfx950) or detect via an explicit target string/mcpu flag (e.g. parse "-mcpu=gfx950"), and remove the broad except by either letting errors propagate or catching only specific attribute/key errors (AttributeError/KeyError) so genuine failures are not hidden; update references inside _is_gfx950 to avoid using Target("rocm").attrs.get("mcpu") directly and ensure the new implementation reliably returns True for gfx950 hardware.
107-109:⚠️ Potential issue | 🟠 MajorAssert the gfx950 direct-to-LDS load, not only the wrapper.
cp_async_gs<16>is emitted from the copy size alone, so this can pass even if the gfx950buffer_load_dwordx4 ... ldspath is broken or removed. Add an assertion for the emitted helper/instruction pattern, e.g.buffer_load_dwordx4pluslds, when compiling for gfx950.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testing/python/amd/test_tilelang_gfx950_copy_async.py` around lines 107 - 109, The test currently only asserts the wrapper "cp_async_gs<16>" was emitted but not the actual gfx950 direct-to-LDS load; update the test around kernel = tl.compile(...) / src = kernel.get_kernel_source() to also assert the gfx950-specific helper/instruction pattern when compiling for gfx950 by checking src contains the direct-load tokens (e.g. "buffer_load_dwordx4" and "lds" or the combined "buffer_load_dwordx4 ... lds") in addition to "cp_async_gs<16>", so the test fails if the buffer_load_dwordx4 to LDS path is missing or removed. Ensure the check is gated to the gfx950 compilation case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@testing/python/amd/test_tilelang_gfx950_copy_async.py`:
- Around line 141-169: The test only uses k_pack=1 so coalesced_width=4 never
exercises the 16-byte async-copy path; update the parameterization for
test_gfx950_copy_async_gemm_pipelined to include at least one case with k_pack=2
(for example add a tuple (False, False, 2) or add 2 to the k_pack choices) so
that _matmul_kernel(...) is invoked with coalesced_width=8 and triggers the
cp_async_gs<16> code path.
- Around line 23-30: The _is_gfx950() helper uses
Target("rocm").attrs.get("mcpu", "") and swallows all errors with a broad
except, causing false negatives; replace its logic to call the existing target
utility (e.g. target_is_gfx950) or detect via an explicit target string/mcpu
flag (e.g. parse "-mcpu=gfx950"), and remove the broad except by either letting
errors propagate or catching only specific attribute/key errors
(AttributeError/KeyError) so genuine failures are not hidden; update references
inside _is_gfx950 to avoid using Target("rocm").attrs.get("mcpu") directly and
ensure the new implementation reliably returns True for gfx950 hardware.
- Around line 107-109: The test currently only asserts the wrapper
"cp_async_gs<16>" was emitted but not the actual gfx950 direct-to-LDS load;
update the test around kernel = tl.compile(...) / src =
kernel.get_kernel_source() to also assert the gfx950-specific helper/instruction
pattern when compiling for gfx950 by checking src contains the direct-load
tokens (e.g. "buffer_load_dwordx4" and "lds" or the combined
"buffer_load_dwordx4 ... lds") in addition to "cp_async_gs<16>", so the test
fails if the buffer_load_dwordx4 to LDS path is missing or removed. Ensure the
check is gated to the gfx950 compilation case.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 774178c1-8e8d-4f55-9032-b5ae6745ffb4
📒 Files selected for processing (1)
testing/python/amd/test_tilelang_gfx950_copy_async.py
) * Fix HIP codegen for sync_warp, sync_grid, and local.var initialisation * [AMD/HIP] Fix warp_reduce VGPR bug, ShuffleNode packing, and Pipelined LDS overflow Extends PR #2096 with three additional fixes for CDNA (MI350) targets: Fix 1 — src/tl_templates/hip/reduce.h: warp_reduce width=32 The old 6-step butterfly called __shfl_xor(value, 32) without a width argument. On CDNA (wave64) with 32 active threads, lanes 32-63 are inactive and hold uninitialised VGPRs, producing NaN in reduce_max / reduce_sum / AllReduce. Fix: remove the step-32 shuffle; pass width=32 to all remaining 5 steps so every shuffle stays within the [0,31] group. Fix 2 — src/target/codegen_hip.cc + src/tl_templates/hip/common.h: ShuffleNode bfloat16x2 / float16x2 packing CodeGenC emitted `uint1(a, b)` for bfloat16x2 construction, which is an invalid HIP constructor call. Fix: override VisitExpr_(ShuffleNode) in CodeGenTileLangHIP to emit `uint1{__pack_bfloat162(a, b)}` / `uint1{ __pack_half2(a, b)}` using aggregate initialisation. Also add five bfloat16x2 math overloads for uint1 carrier (abs2/max2/min2/add2/mul2). Fix 3 — src/transform/pipeline_planning.cc: skip T.Pipelined(num_stages>1) Double-buffering doubled LDS per loop-body buffer. On CDNA (≤128 KB LDS per workgroup), this caused hipModuleLaunchKernel EINVAL. Fix: when TargetIsRocm() && num_stages > 1, skip pipeline planning and fall back to a plain sequential loop with synchronous T.copy. Also: fix __habs(hip_bfloat16) and __habs(float16_t) in common.h to use __builtin_memcpy instead of reinterpret_cast to avoid strict-aliasing UB (as flagged by CodeRabbit on PR #2096). Tests: 19 new cases added to testing/python/amd/test_tilelang_hip_codegen.py covering all three fixes. All 42 tests pass on MI350 (gfx950). * [AMD/HIP] Merge test_tilelang_hip_bugfixes.py into test_tilelang_hip_codegen.py Consolidate all HIP regression tests into a single file. The merged file covers all six fixes with 32 tests total (previously split across two files with duplicated test cases for warp_reduce, pipelined GEMM, and ShuffleNode). Changes versus the two individual files: - Deduplicated test_warp_reduce_no_nan (identical in both files) - Deduplicated test_pipelined_no_lds_overflow / test_pipelined_shared_mem_no_launch_error - Deduplicated test_pipelined_multi_stage_fp16_gemm - Merged bfloat16 shuffle tests: source check + runtime correctness in one function - Kept PR #2096 source-inspection tests (alloc_var, sync_warp, sync_grid) - Added runtime tests from bugfixes: inf init, serial loop accumulation, float scalar readback, two-group wave64 reduce, float16 shuffle * fixup: correct LDS size comment — gfx950 has 160 KB, not 128 KB gfx942 (CDNA3 / MI300X) has 64 KB LDS per workgroup. gfx950 (CDNA4 / MI350) has 160 KB LDS per workgroup (see PR #2058). The old comment said '128 KB' which is wrong for both generations. Updated pipeline_planning.cc and the test docstrings to reflect the correct per-architecture limits. * update for format checking
HI TileLang Team
This PR adds hardware-specific optimizations for the AMD gfx950 (CDNA4 /MI350) GPU architecture, targeting two key improvements:
Changes:
src/tl_templates/hip/copy.h : 1) Added async_buffer_load_dwordx4_v() — a gfx950-specific inline assembly function that issues buffer_load_dwordx4 ... lds for direct global-to-LDS async transfer. 2) Updated cp_async_gs<16> and cp_async_gs_conditional<16> with #if defined(gfx950) guards to dispatch to the new 128-bit path on gfx950, falling back to the existing uint4 pointer-copy path on all other targets.
tilelang/carver/arch/cdna.py: 1) Added _GFX950_LDS_SIZE = 160 * 1024 constant for gfx950's expanded LDS capacity. 2) Updated CDNA.init() to detect gfx950 from the target's mcpu attribute and override smem_cap when the driver-reported value is below 160KB.
Impact
Thanks
Summary by CodeRabbit
New Features
Bug Fixes
Tests