Skip to content

[Bugfix] Raise error on zero grid dimension instead of silent clamp#1994

Merged
LeiWang1999 merged 2 commits into
mainfrom
fix/zero-grid-dim-check-issue-1993
Mar 30, 2026
Merged

[Bugfix] Raise error on zero grid dimension instead of silent clamp#1994
LeiWang1999 merged 2 commits into
mainfrom
fix/zero-grid-dim-check-issue-1993

Conversation

@LeiWang1999
Copy link
Copy Markdown
Member

@LeiWang1999 LeiWang1999 commented Mar 30, 2026

Fixes #1993.

TVM's ThreadWorkLoad::Extract() pre-fills all grid/block dims with 1, and only overwrites them when the actual value is > 0. So if a grid dimension is 0 — either from a dynamic shape like num_tokens being 0 at runtime, or from a literal T.Kernel(0) — the kernel still launches with grid=(1,1,1). No error, no warning. This causes either a CUDA_ERROR_ILLEGAL_ADDRESS crash (the zero-size tensor has data_ptr()==NULL, so the kernel writes through a null pointer) or silent wrong results depending on constant folding.

The fix is straightforward:

  • In thread_storage_scope.h, stop clamping: just write the actual size unconditionally, so a 0 stays 0.
  • In cuda_module.cc, add an ICHECK before any cuLaunchKernel* call that checks all three grid dims are positive, with a message that tells you exactly which dim is zero and which kernel it is.

For example, instead of getting the cryptic async CUDA_ERROR_ILLEGAL_ADDRESS at synchronize(), you now get:

CUDALaunch Error: grid dimension must be positive, but got grid=(0,1,1) in kernel my_kernel.
A zero grid dimension is often caused by a dynamic shape (e.g. num_tokens) being 0 at runtime.

Regression test included for both the dynamic and static cases.

Summary by CodeRabbit

  • Tests

    • Added CUDA-only regression tests validating behavior when kernel grid dimensions are zero, ensuring correct success/failure and no unintended writes.
  • Chores

    • Updated embedded TVM submodule reference to a newer commit.

…1993)

Fix ThreadWorkLoad::Extract() silently clamping zero grid dims to 1,
which caused either CUDA_ERROR_ILLEGAL_ADDRESS crashes (dynamic case)
or silent wrong results (static case).

Closes #1993
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f599ac0e-a952-4cea-9554-d817e9ce4068

📥 Commits

Reviewing files that changed from the base of the PR and between da66668 and 4960186.

📒 Files selected for processing (1)
  • testing/python/issue/test_tilelang_issue_1993.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • testing/python/issue/test_tilelang_issue_1993.py

📝 Walkthrough

Walkthrough

Updated the TVM submodule reference and added a CUDA-only regression test validating TileLang behavior when T.Kernel grid dimensions resolve to zero (both dynamic and static cases), exercising runtime launch/guard behavior.

Changes

Cohort / File(s) Summary
TVM Submodule Update
3rdparty/tvm
Bumped embedded TVM commit reference to a newer commit that changes handling of zero grid dimensions in TVM runtime (affects ThreadWorkLoad extraction / grid dim behavior). Review runtime launch behavior in updated TVM.
Regression Tests for Issue #1993
testing/python/issue/test_tilelang_issue_1993.py
Added CUDA-only tests: dynamic-grid-case (uses T.dynamic("num_tokens")) and static-grid-case (T.Kernel(0)) to assert correct error/guard behavior when effective grid dimension is zero. Uses CUDA sync to surface async errors.
sequenceDiagram
    participant Test as "Test (pytest)"
    participant TileLang as "TileLang JIT kernel"
    participant TVM as "TVM runtime"
    participant CUDA as "CUDA driver"

    Test->>TileLang: invoke kernel(out, num_tokens)
    TileLang->>TVM: dispatch kernel launch (grid dims)
    alt any grid dim == 0
        TVM->>TVM: detect zero grid dim (guard)
        TVM-->>TileLang: raise error OR skip launch
        TileLang-->>Test: raise exception / no-op return
    else all dims > 0
        TVM->>CUDA: cuLaunchKernelEx(...)
        CUDA-->>TVM: launch success/failure
        TVM-->>TileLang: return status
        TileLang-->>Test: return / possible async error on sync
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 In kernels where zeros once slyly crept,
I hopped through code and bravely leapt.
TVM now checks before the call,
Tests stand guard to catch the fall.
Hooray — no more surprises in the CUDA hall! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main fix: raising an error on zero grid dimension instead of silently clamping, which matches the primary changes in the PR.
Linked Issues check ✅ Passed The PR addresses all coding requirements from issue #1993: preventing silent clamping of zero grid dimensions [#1993], raising clear errors [#1993], and adding regression tests for both dynamic and static cases [#1993].
Out of Scope Changes check ✅ Passed All changes are directly related to fixing issue #1993: the TVM submodule update likely includes the thread_storage_scope.h and cuda_module.cc fixes, and the new test file validates the exact scenarios described in the issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/zero-grid-dim-check-issue-1993

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
testing/python/issue/test_tilelang_issue_1993.py (1)

67-72: Same exception type refinement applies here.

Same recommendation as above—use a specific exception type. The buffer assertion on line 72 is good practice to verify no spurious writes occurred.

♻️ Suggested improvement
-    with pytest.raises(Exception):
+    with pytest.raises(RuntimeError, match=r"grid dimension must be positive"):
         kernel(out)
         torch.cuda.synchronize()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testing/python/issue/test_tilelang_issue_1993.py` around lines 67 - 72,
Replace the broad pytest.raises(Exception) with the specific exception class
actually raised by kernel(out) or torch.cuda.synchronize() (e.g., RuntimeError
or the exact CUDA error type your code emits); update the context manager from
pytest.raises(Exception) to pytest.raises(<ExactExceptionClass>) while keeping
the kernel(out) and torch.cuda.synchronize() calls intact and preserving the
final buffer check assert out.eq(0.0).all().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@testing/python/issue/test_tilelang_issue_1993.py`:
- Around line 49-53: Replace the broad Exception catches with the specific
tvm.TVMError in both test functions test_issue_1993_dynamic_zero_grid_dim and
test_issue_1993_static_zero_grid_dim: change the with pytest.raises(Exception):
blocks that wrap kernel(out_empty) and torch.cuda.synchronize() to with
pytest.raises(tvm.TVMError): and add the import statement from tilelang import
tvm near the top of the test file if it’s missing so the tvm symbol is
available.

---

Nitpick comments:
In `@testing/python/issue/test_tilelang_issue_1993.py`:
- Around line 67-72: Replace the broad pytest.raises(Exception) with the
specific exception class actually raised by kernel(out) or
torch.cuda.synchronize() (e.g., RuntimeError or the exact CUDA error type your
code emits); update the context manager from pytest.raises(Exception) to
pytest.raises(<ExactExceptionClass>) while keeping the kernel(out) and
torch.cuda.synchronize() calls intact and preserving the final buffer check
assert out.eq(0.0).all().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 323df7b9-c661-48bf-a105-9e109ce72a4c

📥 Commits

Reviewing files that changed from the base of the PR and between 3a956e0 and da66668.

📒 Files selected for processing (2)
  • 3rdparty/tvm
  • testing/python/issue/test_tilelang_issue_1993.py

Comment on lines +49 to +53
# Zero case: should raise an error, not crash with illegal memory access
out_empty = torch.zeros(0, dtype=torch.float32, device="cuda")
with pytest.raises(Exception):
kernel(out_empty)
torch.cuda.synchronize()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Search for how ICHECK failures surface as exceptions in TVM/TileLang
# to determine the correct exception type to catch

# Look for existing tests that catch TVM runtime errors
rg -n "pytest\.raises.*RuntimeError|pytest\.raises.*TVMError" --type=py testing/

Repository: tile-ai/tilelang

Length of output: 499


🏁 Script executed:

# Read the test file around the flagged lines to understand context
fd test_tilelang_issue_1993.py --type=f -x cat -n {}

Repository: tile-ai/tilelang

Length of output: 2609


🏁 Script executed:

# Search for ICHECK statements in the codebase to understand what exception they raise
rg -n "ICHECK" --type=py --type=cpp --type=cc -B 2 -A 2 | head -100

Repository: tile-ai/tilelang

Length of output: 86


🏁 Script executed:

# Search for grid dimension validation or error messages
rg -i "grid.*dimension|dimension.*grid" --type=py --type=cpp -B 2 -A 2

Repository: tile-ai/tilelang

Length of output: 4717


🏁 Script executed:

# Look for kernel grid or block dimension validation
rg -n "T\.Kernel" --type=py -B 2 -A 2 | head -80

Repository: tile-ai/tilelang

Length of output: 6797


🏁 Script executed:

# Check what tvm.TVMError is and if it's imported in test files
rg -n "import.*tvm|from.*tvm" testing/python/ --type=py | grep -E "TVMError|tvm" | head -20

Repository: tile-ai/tilelang

Length of output: 2176


🏁 Script executed:

# Search for grid dimension validation/checking code in kernel-related files
rg -n "grid|block|kernel" tilelang/language/kernel.py --type=py | head -60

Repository: tile-ai/tilelang

Length of output: 3567


🏁 Script executed:

# Look for ICHECK patterns that might validate grid dimensions
rg -n "ICHECK.*grid|grid.*0|dimension.*0" --type=py --type=cpp | head -40

Repository: tile-ai/tilelang

Length of output: 4297


🏁 Script executed:

# Check if tvm is imported or accessible in the tilelang package
rg -n "from.*tvm|import.*tvm" tilelang/ --type=py | head -20

Repository: tile-ai/tilelang

Length of output: 1422


🏁 Script executed:

# Search for where grid dimension 0 is validated or checked
rg -n "grid.*>.*0|grid.*== 0|grid.*!= 0|positive.*grid" --type=py --type=cpp | head -30

Repository: tile-ai/tilelang

Length of output: 801


🏁 Script executed:

# Look for ICHECK statements related to grid that might validate dimension > 0
rg -n "ICHECK.*grid" --type=cpp -B 2 -A 2

Repository: tile-ai/tilelang

Length of output: 1946


🏁 Script executed:

# Search in the wrapper/adapter code where kernel launches are prepared
rg -n "gridDim|grid_info" tilelang/jit/adapter/wrapper.py --type=py -B 3 -A 3 | head -80

Repository: tile-ai/tilelang

Length of output: 3694


🏁 Script executed:

# Search for how grid dimension validation happens in kernel launch
rg -n "dynamic.*grid|grid.*dynamic" tilelang/jit/ --type=py -B 2 -A 2

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Look at the exact test file to see what's imported and what's available
cat -n testing/python/issue/test_tilelang_issue_1993.py

Repository: tile-ai/tilelang

Length of output: 2609


🏁 Script executed:

# Search for where the kernel is actually compiled/launched to see where the validation happens
rg -n "def kernel|LaunchThread|KernelLaunch" tilelang/language/kernel.py --type=py -B 2 -A 5 | head -60

Repository: tile-ai/tilelang

Length of output: 2081


🏁 Script executed:

# Search for where runtime validation happens for grid dimensions
rg -n "must be positive|positive.*grid|grid.*positive|must be.*non-zero" --type=py --type=cpp

Repository: tile-ai/tilelang

Length of output: 2770


🏁 Script executed:

# Look for exception raising in the wrapper code
rg -n "raise.*Error|raise.*Exception|RuntimeError|TVMError" tilelang/jit/adapter/wrapper.py --type=py -B 2 -A 2 | head -60

Repository: tile-ai/tilelang

Length of output: 2595


🏁 Script executed:

# Check if there's any validation in the tilelang package that throws on grid dimension 0
rg -n "grid.*== 0|grid_info.*0" tilelang/jit/ --type=py -B 3 -A 3

Repository: tile-ai/tilelang

Length of output: 2298


🏁 Script executed:

# Check if tvm or TVMError is accessible from tilelang package exports
rg -n "TVMError|from tvm import" tilelang/__init__.py --type=py -B 2 -A 2

Repository: tile-ai/tilelang

Length of output: 237


🏁 Script executed:

# Look at how the test_tilelang_transform_make_packed_api.py imports and uses TVMError
cat -n testing/python/transform/test_tilelang_transform_make_packed_api.py | head -40

Repository: tile-ai/tilelang

Length of output: 1585


🏁 Script executed:

# Check the runtime validation code to understand what exception is raised
cat -n src/runtime/runtime.cc | sed -n '340,365p'

Repository: tile-ai/tilelang

Length of output: 1366


🏁 Script executed:

# Look at how grid dimensions are validated and what exception is raised
rg -n "issues.push_back.*globalDim" src/runtime/runtime.cc -B 5 -A 10

Repository: tile-ai/tilelang

Length of output: 1630


🏁 Script executed:

# Check what happens with the issues list - how they're converted to exceptions
rg -n "issues" src/runtime/runtime.cc | head -40

Repository: tile-ai/tilelang

Length of output: 2618


🏁 Script executed:

# Verify if tvm.TVMError is the correct exception by looking at how it's used in similar validation tests
cat -n testing/python/transform/test_tilelang_transform_make_packed_api.py | sed -n '170,195p'

Repository: tile-ai/tilelang

Length of output: 813


Use tvm.TVMError instead of the overly broad Exception.

The test currently catches Exception, which could mask unrelated failures. Based on the runtime validation logic, when grid dimensions resolve to zero, a tvm.TVMError is raised with the message "globalDim[0] must be non-zero". This matches the pattern used in other validation tests in the codebase.

Update both test functions to use the specific exception type:

♻️ Suggested improvements

In test function test_issue_1993_dynamic_zero_grid_dim() (lines 51-53):

-    with pytest.raises(Exception):
+    with pytest.raises(tvm.TVMError):
         kernel(out_empty)
         torch.cuda.synchronize()

In test function test_issue_1993_static_zero_grid_dim() (lines 67-69):

-    with pytest.raises(Exception):
+    with pytest.raises(tvm.TVMError):
         kernel(out)
         torch.cuda.synchronize()

Note: You'll need to import tvm in the test file. Add from tilelang import tvm near the top if not already present.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Zero case: should raise an error, not crash with illegal memory access
out_empty = torch.zeros(0, dtype=torch.float32, device="cuda")
with pytest.raises(Exception):
kernel(out_empty)
torch.cuda.synchronize()
# Zero case: should raise an error, not crash with illegal memory access
out_empty = torch.zeros(0, dtype=torch.float32, device="cuda")
with pytest.raises(tvm.TVMError):
kernel(out_empty)
torch.cuda.synchronize()
🧰 Tools
🪛 Ruff (0.15.7)

[warning] 51-51: Do not assert blind exception: Exception

(B017)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testing/python/issue/test_tilelang_issue_1993.py` around lines 49 - 53,
Replace the broad Exception catches with the specific tvm.TVMError in both test
functions test_issue_1993_dynamic_zero_grid_dim and
test_issue_1993_static_zero_grid_dim: change the with pytest.raises(Exception):
blocks that wrap kernel(out_empty) and torch.cuda.synchronize() to with
pytest.raises(tvm.TVMError): and add the import statement from tilelang import
tvm near the top of the test file if it’s missing so the tvm symbol is
available.

@LeiWang1999 LeiWang1999 merged commit 1c561f6 into main Mar 30, 2026
5 of 6 checks passed
@LeiWang1999 LeiWang1999 deleted the fix/zero-grid-dim-check-issue-1993 branch April 14, 2026 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] T.Kernel with grid dimension=0 silently launches with grid=(1,1,1) instead of being a no-op or reporting any error

1 participant