Skip to content

[NVPTX] Add prefetch tensormap variant #146203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

abhilash1910
Copy link
Contributor

[NVPTX] Add Prefetch tensormap intrinsics
This PR adds prefetch intrinsics with the relevant tensormap_space.

  • Lit tests are added as part of prefetch.ll
  • The generated PTX is verified with a 12.3 ptxas executable.
  • Added docs for these intrinsics in NVPTXUsage.rst.

For more information, refer to the PTX ISA for prefetch intrinsic : Prefetch Tensormap

@durga4github @schwarzschild-radius

@llvmbot
Copy link
Member

llvmbot commented Jun 28, 2025

@llvm/pr-subscribers-llvm-ir

Author: Abhilash Majumder (abhilash1910)

Changes

[NVPTX] Add Prefetch tensormap intrinsics
This PR adds prefetch intrinsics with the relevant tensormap_space.

  • Lit tests are added as part of prefetch.ll
  • The generated PTX is verified with a 12.3 ptxas executable.
  • Added docs for these intrinsics in NVPTXUsage.rst.

For more information, refer to the PTX ISA for prefetch intrinsic : Prefetch Tensormap

@durga4github @schwarzschild-radius


Full diff: https://github.com/llvm/llvm-project/pull/146203.diff

4 Files Affected:

  • (modified) llvm/docs/NVPTXUsage.rst (+7-1)
  • (modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+4)
  • (modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+9)
  • (modified) llvm/test/CodeGen/NVPTX/prefetch.ll (+30)
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 11017fe4e01b4..ca951811b73dd 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -971,6 +971,9 @@ Syntax:
   declare void  @llvm.nvvm.prefetch.L1(ptr %ptr)
   declare void  @llvm.nvvm.prefetch.L2(ptr %ptr)
   
+  declare void  @llvm.nvvm.prefetch.tensormap.p0(ptr %ptr)
+  declare void  @llvm.nvvm.prefetch.tensormap.p4(ptr addrspace(4) %const_ptr)
+  
   declare void  @llvm.nvvm.prefetch.global.L2.evict.normal(ptr addrspace(1) %global_ptr)
   declare void  @llvm.nvvm.prefetch.global.L2.evict.last(ptr addrspace(1) %global_ptr)
 
@@ -983,7 +986,10 @@ The '``@llvm.nvvm.prefetch.*``' and '``@llvm.nvvm.prefetchu.*``' intrinsic
 correspond to the '``prefetch.*``;' and '``prefetchu.*``' family of PTX instructions. 
 The '``prefetch.*``' instructions bring the cache line containing the
 specified address in global or local memory address space into the 
-specified cache level (L1 or L2). The '`prefetchu.*``' instruction brings the cache line 
+specified cache level (L1 or L2). If the '``.tensormap``' qualifier is specified then the 
+prefetch instruction brings the cache line containing the specified address in the 
+'``.const``' or '``.param memory``' state space for subsequent use by the '``cp.async.bulk.tensor``' 
+instruction.The '`prefetchu.*``' instruction brings the cache line 
 containing the specified generic address into the specified uniform cache level.
 If no address space is specified, it is assumed to be generic address. The intrinsic 
 uses and eviction priority which can be accessed by the '``.level::eviction_priority``' modifier.
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 0375f29ad8906..0678bba51e4a3 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -137,6 +137,7 @@
 
 def llvm_global_ptr_ty  : LLVMQualPointerType<1>;         // (global)ptr
 def llvm_shared_ptr_ty  : LLVMQualPointerType<3>;         // (shared)ptr
+def llvm_constant_ptr_ty: LLVMQualPointerType<4>;         // (const)ptr
 def llvm_local_ptr_ty   : LLVMQualPointerType<5>;         // (local)ptr
 def llvm_tmem_ptr_ty    : LLVMQualPointerType<6>;         // (tensor memory)ptr
 def llvm_shared_cluster_ptr_ty : LLVMQualPointerType<7>;  // (shared_cluster)ptr
@@ -2092,6 +2093,9 @@ let IntrProperties = [IntrArgMemOnly, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<
     def int_nvvm_prefetch_local_ # level : Intrinsic<[], [llvm_local_ptr_ty]>;
   }
 
+  def int_nvvm_prefetch_tensormap_p0: DefaultAttrsIntrinsic<[], [llvm_ptr_ty]>;
+  def int_nvvm_prefetch_tensormap_p4 : DefaultAttrsIntrinsic<[], [llvm_constant_ptr_ty]>;
+  
   foreach eviction_priority = ["evict_normal", "evict_last"] in
     def int_nvvm_prefetch_global_L2_ # eviction_priority : Intrinsic<[], [llvm_global_ptr_ty]>;
 
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index cc1fd027d8515..8afc7063c363a 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -760,6 +760,15 @@ def PREFETCH_GLOBAL_L1 : PREFETCH_INTRS<"prefetch.global.L1">;
 def PREFETCH_LOCAL_L1  : PREFETCH_INTRS<"prefetch.local.L1">;
 def PREFETCH_GLOBAL_L2 : PREFETCH_INTRS<"prefetch.global.L2">;
 def PREFETCH_LOCAL_L2 : PREFETCH_INTRS<"prefetch.local.L2">;
+def PREFETCH_CONST_TENSORMAP        : BasicNVPTXInst<(outs), (ins ADDR:$addr),
+                                      "prefetch.const.tensormap",
+                                      [(int_nvvm_prefetch_tensormap_p4 addr:$addr)]>,
+                                      Requires<[hasPTX<80>, hasSM<90>]>;
+                                      
+def PREFETCH_GENERIC_TENSORMAP      : BasicNVPTXInst<(outs), (ins ADDR:$addr),
+                                      "prefetch.tensormap",
+                                      [(int_nvvm_prefetch_tensormap_p0 addr:$addr)]>,
+                                      Requires<[hasPTX<80>, hasSM<90>]>;
 
 def PREFETCH_GLOBAL_L2_EVICT_NORMAL : BasicNVPTXInst<(outs), (ins ADDR:$addr),
                                       "prefetch.global.L2::evict_normal",
diff --git a/llvm/test/CodeGen/NVPTX/prefetch.ll b/llvm/test/CodeGen/NVPTX/prefetch.ll
index a64e4fe7a508e..b63155ff49185 100644
--- a/llvm/test/CodeGen/NVPTX/prefetch.ll
+++ b/llvm/test/CodeGen/NVPTX/prefetch.ll
@@ -12,6 +12,9 @@ declare void  @llvm.nvvm.prefetch.local.L2(ptr addrspace(5) %local_ptr)
 declare void  @llvm.nvvm.prefetch.L1(ptr %ptr)
 declare void  @llvm.nvvm.prefetch.L2(ptr %ptr)
 
+declare void  @llvm.nvvm.prefetch.tensormap.p0(ptr %ptr)
+declare void  @llvm.nvvm.prefetch.tensormap.p4(ptr addrspace(4) %const_ptr)
+
 declare void  @llvm.nvvm.prefetch.global.L2.evict.normal(ptr addrspace(1) %global_ptr)
 declare void  @llvm.nvvm.prefetch.global.L2.evict.last(ptr addrspace(1) %global_ptr)
 
@@ -67,6 +70,33 @@ define void @prefetch_(ptr %ptr) {
   ret void
 }
 
+
+define void @prefetch_generic_tensormap(ptr %ptr) {
+; CHECK-PTX64-LABEL: prefetch_generic_tensormap(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [prefetch_generic_tensormap_param_0];
+; CHECK-PTX64-NEXT:    prefetch.tensormap [%rd1];
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.prefetch.tensormap.p0(ptr %ptr)
+  ret void
+}
+
+define void @prefetch_const_tensormap(ptr addrspace(4) %const_ptr) {
+; CHECK-PTX64-LABEL: prefetch_const_tensormap(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [prefetch_const_tensormap_param_0];
+; CHECK-PTX64-NEXT:    prefetch.const.tensormap [%rd1];
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.prefetch.tensormap.p4(ptr addrspace(4)  %const_ptr)
+  ret void
+}
+
 define void @prefetchu_l1(ptr %ptr) {
 ; CHECK-PTX64-LABEL: prefetchu_l1(
 ; CHECK-PTX64:       {

@llvmbot
Copy link
Member

llvmbot commented Jun 28, 2025

@llvm/pr-subscribers-backend-nvptx

Author: Abhilash Majumder (abhilash1910)

Changes

[NVPTX] Add Prefetch tensormap intrinsics
This PR adds prefetch intrinsics with the relevant tensormap_space.

  • Lit tests are added as part of prefetch.ll
  • The generated PTX is verified with a 12.3 ptxas executable.
  • Added docs for these intrinsics in NVPTXUsage.rst.

For more information, refer to the PTX ISA for prefetch intrinsic : Prefetch Tensormap

@durga4github @schwarzschild-radius


Full diff: https://github.com/llvm/llvm-project/pull/146203.diff

4 Files Affected:

  • (modified) llvm/docs/NVPTXUsage.rst (+7-1)
  • (modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+4)
  • (modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+9)
  • (modified) llvm/test/CodeGen/NVPTX/prefetch.ll (+30)
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 11017fe4e01b4..ca951811b73dd 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -971,6 +971,9 @@ Syntax:
   declare void  @llvm.nvvm.prefetch.L1(ptr %ptr)
   declare void  @llvm.nvvm.prefetch.L2(ptr %ptr)
   
+  declare void  @llvm.nvvm.prefetch.tensormap.p0(ptr %ptr)
+  declare void  @llvm.nvvm.prefetch.tensormap.p4(ptr addrspace(4) %const_ptr)
+  
   declare void  @llvm.nvvm.prefetch.global.L2.evict.normal(ptr addrspace(1) %global_ptr)
   declare void  @llvm.nvvm.prefetch.global.L2.evict.last(ptr addrspace(1) %global_ptr)
 
@@ -983,7 +986,10 @@ The '``@llvm.nvvm.prefetch.*``' and '``@llvm.nvvm.prefetchu.*``' intrinsic
 correspond to the '``prefetch.*``;' and '``prefetchu.*``' family of PTX instructions. 
 The '``prefetch.*``' instructions bring the cache line containing the
 specified address in global or local memory address space into the 
-specified cache level (L1 or L2). The '`prefetchu.*``' instruction brings the cache line 
+specified cache level (L1 or L2). If the '``.tensormap``' qualifier is specified then the 
+prefetch instruction brings the cache line containing the specified address in the 
+'``.const``' or '``.param memory``' state space for subsequent use by the '``cp.async.bulk.tensor``' 
+instruction.The '`prefetchu.*``' instruction brings the cache line 
 containing the specified generic address into the specified uniform cache level.
 If no address space is specified, it is assumed to be generic address. The intrinsic 
 uses and eviction priority which can be accessed by the '``.level::eviction_priority``' modifier.
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 0375f29ad8906..0678bba51e4a3 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -137,6 +137,7 @@
 
 def llvm_global_ptr_ty  : LLVMQualPointerType<1>;         // (global)ptr
 def llvm_shared_ptr_ty  : LLVMQualPointerType<3>;         // (shared)ptr
+def llvm_constant_ptr_ty: LLVMQualPointerType<4>;         // (const)ptr
 def llvm_local_ptr_ty   : LLVMQualPointerType<5>;         // (local)ptr
 def llvm_tmem_ptr_ty    : LLVMQualPointerType<6>;         // (tensor memory)ptr
 def llvm_shared_cluster_ptr_ty : LLVMQualPointerType<7>;  // (shared_cluster)ptr
@@ -2092,6 +2093,9 @@ let IntrProperties = [IntrArgMemOnly, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<
     def int_nvvm_prefetch_local_ # level : Intrinsic<[], [llvm_local_ptr_ty]>;
   }
 
+  def int_nvvm_prefetch_tensormap_p0: DefaultAttrsIntrinsic<[], [llvm_ptr_ty]>;
+  def int_nvvm_prefetch_tensormap_p4 : DefaultAttrsIntrinsic<[], [llvm_constant_ptr_ty]>;
+  
   foreach eviction_priority = ["evict_normal", "evict_last"] in
     def int_nvvm_prefetch_global_L2_ # eviction_priority : Intrinsic<[], [llvm_global_ptr_ty]>;
 
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index cc1fd027d8515..8afc7063c363a 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -760,6 +760,15 @@ def PREFETCH_GLOBAL_L1 : PREFETCH_INTRS<"prefetch.global.L1">;
 def PREFETCH_LOCAL_L1  : PREFETCH_INTRS<"prefetch.local.L1">;
 def PREFETCH_GLOBAL_L2 : PREFETCH_INTRS<"prefetch.global.L2">;
 def PREFETCH_LOCAL_L2 : PREFETCH_INTRS<"prefetch.local.L2">;
+def PREFETCH_CONST_TENSORMAP        : BasicNVPTXInst<(outs), (ins ADDR:$addr),
+                                      "prefetch.const.tensormap",
+                                      [(int_nvvm_prefetch_tensormap_p4 addr:$addr)]>,
+                                      Requires<[hasPTX<80>, hasSM<90>]>;
+                                      
+def PREFETCH_GENERIC_TENSORMAP      : BasicNVPTXInst<(outs), (ins ADDR:$addr),
+                                      "prefetch.tensormap",
+                                      [(int_nvvm_prefetch_tensormap_p0 addr:$addr)]>,
+                                      Requires<[hasPTX<80>, hasSM<90>]>;
 
 def PREFETCH_GLOBAL_L2_EVICT_NORMAL : BasicNVPTXInst<(outs), (ins ADDR:$addr),
                                       "prefetch.global.L2::evict_normal",
diff --git a/llvm/test/CodeGen/NVPTX/prefetch.ll b/llvm/test/CodeGen/NVPTX/prefetch.ll
index a64e4fe7a508e..b63155ff49185 100644
--- a/llvm/test/CodeGen/NVPTX/prefetch.ll
+++ b/llvm/test/CodeGen/NVPTX/prefetch.ll
@@ -12,6 +12,9 @@ declare void  @llvm.nvvm.prefetch.local.L2(ptr addrspace(5) %local_ptr)
 declare void  @llvm.nvvm.prefetch.L1(ptr %ptr)
 declare void  @llvm.nvvm.prefetch.L2(ptr %ptr)
 
+declare void  @llvm.nvvm.prefetch.tensormap.p0(ptr %ptr)
+declare void  @llvm.nvvm.prefetch.tensormap.p4(ptr addrspace(4) %const_ptr)
+
 declare void  @llvm.nvvm.prefetch.global.L2.evict.normal(ptr addrspace(1) %global_ptr)
 declare void  @llvm.nvvm.prefetch.global.L2.evict.last(ptr addrspace(1) %global_ptr)
 
@@ -67,6 +70,33 @@ define void @prefetch_(ptr %ptr) {
   ret void
 }
 
+
+define void @prefetch_generic_tensormap(ptr %ptr) {
+; CHECK-PTX64-LABEL: prefetch_generic_tensormap(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [prefetch_generic_tensormap_param_0];
+; CHECK-PTX64-NEXT:    prefetch.tensormap [%rd1];
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.prefetch.tensormap.p0(ptr %ptr)
+  ret void
+}
+
+define void @prefetch_const_tensormap(ptr addrspace(4) %const_ptr) {
+; CHECK-PTX64-LABEL: prefetch_const_tensormap(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [prefetch_const_tensormap_param_0];
+; CHECK-PTX64-NEXT:    prefetch.const.tensormap [%rd1];
+; CHECK-PTX64-NEXT:    ret;
+  tail call void @llvm.nvvm.prefetch.tensormap.p4(ptr addrspace(4)  %const_ptr)
+  ret void
+}
+
 define void @prefetchu_l1(ptr %ptr) {
 ; CHECK-PTX64-LABEL: prefetchu_l1(
 ; CHECK-PTX64:       {

Copy link
Contributor

@durga4github durga4github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest revision lgtm

@durga4github durga4github requested a review from AlexMaclean July 1, 2025 11:19
Comment on lines 2096 to 2097
def int_nvvm_prefetch_generic_tensormap : DefaultAttrsIntrinsic<[], [llvm_ptr_ty]>;
def int_nvvm_prefetch_const_tensormap : DefaultAttrsIntrinsic<[], [llvm_constant_ptr_ty]>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would lean towards implementing an intrinsic such as this as an overloaded intrinsic, instead of creating several different intrinsics each with the addrspace encoded in the name. In addition to the spaces you've specified it looks like the PTX instruction supports param space as well, it's unwieldy to have a separate intrinsic for each of these (especially if we were to add more supported AS in the future). A single overloaded intrinsic would be simpler to work with, such as when propagating the AS into it in infer-addrspace, and extend in the future. While it's true that an overloaded intrinsic would allow more invalid IR to be specified, this is true for all overloaded intrinsics (ie. memset to constant AS) and generally not a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alex could you help with a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants