Experimental FP Linear Implementation with NV cooperative matrix 2 extension by HarryHu-art · Pull Request #17501 · pytorch/executorch

HarryHu-art · 2026-02-17T19:36:58Z

Summary:
Experimental FP Linear Implementation with NV cooperativate matrix 2

 buck run  fbcode/mode/win //xplat/executorch/backends/vulkan/test/custom_ops:test_fp_linear
>>
File changed: fbcode//executorch/backends/vulkan/runtime/gen_vulkan_spv.py
File changed: fbcode//executorch/backends/vulkan/runtime/graph/ops/glsl/pack_fp_linear_weight.yaml
File changed: fbcode//executorch/backends/vulkan/runtime/graph/ops/impl/LinearExperimental.cpp
15 additional file change events
Buck UI: https://www.internalfb.com/buck2/34f0710d-d349-4cba-9e35-10926968dd39
Network: Up: 0B  Down: 0B
Command: run.
Time elapsed: 19.0s
BUILD SUCCEEDED - starting your binary

=== Compute Shader Performance Benchmark ===
FP32/FP16 Linear Layer Benchmark
----------------------------------------------------------------------

=== Cooperative Matrix Properties ===
Loader Message 0 Inserted device layer "VK_LAYER_KHRONOS_validation" (C:\VulkanSDK\1.4.321.1\Bin\.\VkLayer_khronos_validation.dll)
Loader Message 0 Inserted device layer "VK_LAYER_NV_present" (C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll)
Loader Message 0 Inserted device layer "VK_LAYER_NV_optimus" (C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll)
Loader Message 0 vkCreateDevice layer callstack setup to:
Loader Message 0    <Application>
Loader Message 0      ||
Loader Message 0    <Loader>
Loader Message 0      ||
Loader Message 0    VK_LAYER_NV_optimus
Loader Message 0            Type: Implicit
Loader Message 0            Enabled By: Implicit Layer
Loader Message 0                Disable Env Var:  DISABLE_LAYER_NV_OPTIMUS_1
Loader Message 0            Manifest: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\nv-vk64.json
Loader Message 0            Library:  C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll
Loader Message 0      ||
Loader Message 0    VK_LAYER_NV_present
Loader Message 0            Type: Implicit
Loader Message 0            Enabled By: Implicit Layer
Loader Message 0                Disable Env Var:  DISABLE_LAYER_NV_PRESENT_1
Loader Message 0            Manifest: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\nv-vk64.json
Loader Message 0            Library:  C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll
Loader Message 0      ||
Loader Message 0    VK_LAYER_KHRONOS_validation
Loader Message 0            Type: Explicit
Loader Message 0            Enabled By: By the Application
Loader Message 0            Manifest: C:\VulkanSDK\1.4.321.1\Bin\VkLayer_khronos_validation.json
Loader Message 0            Library:  C:\VulkanSDK\1.4.321.1\Bin\.\VkLayer_khronos_validation.dll
Loader Message 0      ||
Loader Message 0    <Device>
Loader Message 0        Using "NVIDIA GeForce RTX 5080" with driver: "C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll"
Validation 0 vkCreateImage(): The following VkImageCreateInfo returned VK_ERROR_FORMAT_NOT_SUPPORTED when calling vkGetPhysicalDeviceImageFormatProperties2
format (VK_FORMAT_R32G32B32A32_SFLOAT)
type (VK_IMAGE_TYPE_3D)
tiling (VK_IMAGE_TILING_LINEAR)
usage (VK_IMAGE_USAGE_SAMPLED_BIT|VK_IMAGE_USAGE_STORAGE_BIT)
flags (VkImageCreateFlags(0))
VkImageCreateInfo::pNext is NULL.
The Vulkan spec states: Each of the following values (as described in Image Creation Limits) must not be undefined : imageCreateMaxMipLevels, imageCreateMaxArrayLayers, imageCreateMaxExtent, and imageCreateSampleCounts (https://vulkan.lunarg.com/doc/view/1.4.321.1/windows/antora/spec/latest/chapters/resources.html#VUID-VkImageCreateInfo-imageCreateMaxMipLevels-02251)
Found 15 cooperative matrix configurations:
----------------------------------------------------------------------
  #  |   M  |   N  |   K  | A Type  | B Type  | C Type  | R Type  | Scope
----------------------------------------------------------------------
   0 |   16 |   16 |   16 | float16 | float16 | float16 | float16 | Subgroup
   1 |   16 |    8 |   16 | float16 | float16 | float16 | float16 | Subgroup
   2 |   16 |    8 |    8 | float16 | float16 | float16 | float16 | Subgroup
   3 |   16 |   16 |   16 | float16 | float16 | float32 | float32 | Subgroup
   4 |   16 |    8 |   16 | float16 | float16 | float32 | float32 | Subgroup
   5 |   16 |    8 |    8 | float16 | float16 | float32 | float32 | Subgroup
   6 |   16 |   16 |   32 | uint8   | uint8   | uint32  | uint32  | Subgroup
   7 |   16 |   16 |   32 | int8    | int8    | int32   | int32   | Subgroup
   8 |   16 |    8 |   32 | uint8   | uint8   | uint32  | uint32  | Subgroup
   9 |   16 |    8 |   32 | int8    | int8    | int32   | int32   | Subgroup
  10 |   16 |   16 |   16 | unknown | unknown | float32 | float32 | Subgroup
  11 |   16 |   16 |   32 | unknown | unknown | float16 | float16 | Subgroup
  12 |   16 |   16 |   32 | unknown | unknown | float32 | float32 | Subgroup
  13 |   16 |   16 |   32 | unknown | unknown | float16 | float16 | Subgroup
  14 |   16 |   16 |   32 | unknown | unknown | float32 | float32 | Subgroup
----------------------------------------------------------------------

Configurations with float32 A, B, C types:

Configurations with float16 A/B, float32 C (mixed precision):
  M=16, N=16, K=16, Scope=Subgroup
  M=16, N=8, K=16, Scope=Subgroup
  M=16, N=8, K=8, Scope=Subgroup

Test: ACCU  B=4  I=128  O=128  Buf  fp16+bias L

input_tensor Data:
  Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM)
  Total elements: 512
  Data (first 64 elements): [-0.250732, 0.592773, 0.901367, -0.632812, 0.463867, 0.559082, 0.197266, 0.193604, -0.687500, -0.108276, -0.687988, -0.799805, -0.883789, -0.081482, 0.731934, -0.332520, 0.202148, -0.713867, 0.416016, 0.301758, -0.958496, -0.886719, 0.939453, 0.443848, 0.664551, 0.876953, -0.575195, -0.998047, -0.636230, 0.984375, -0.632812, 0.234863, -0.391357, 0.223267, 0.049500, -0.985840, -0.136108, -0.953613, -0.417480, 0.049530, 0.223633, -0.200195, -0.720703, -0.906250, -0.415527, 0.947266, -0.267090, -0.534180, -0.087830, -0.818359, 0.570312, 0.236694, -0.600586, -0.234985, 0.028458, 0.966309, 0.184814, -0.066467, -0.906738, 0.719727, 0.215088, 0.360596, -0.658691, -0.098999, ... (448 more)] 
  Statistics: min=0.229682, max=1.468703, mean=0.922048, sum=472.088684

weight_tensor Data:
  Type: ValueSpec(type=Tensor, sizes=[128, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM)
  Total elements: 16384
  Data (first 64 elements): [-0.769531, -0.006275, 0.218018, -0.793945, -0.732910, -0.702148, -0.518555, -0.655273, -0.345703, 0.622070, 0.718262, -0.850098, 0.332031, -0.909668, 0.082275, 0.224243, -0.941895, 0.189575, 0.467285, -0.498535, -0.210083, -0.705078, 0.604004, -0.977051, -0.490967, -0.063721, -0.885742, 0.909180, 0.732910, 0.111511, -0.557617, 0.299316, -0.189941, 0.161743, -0.367676, -0.662109, -0.846191, -0.168213, 0.686035, 0.305908, 0.697754, 0.241089, 0.942871, -0.192383, -0.229126, 0.747070, 0.908691, 0.110107, -0.108459, -0.142090, 0.339355, -0.720215, -0.834961, -0.539062, 0.793945, -0.605469, -0.403809, 0.596680, -0.475342, -0.335205, -0.989258, -0.157715, 0.086365, -0.110352, ... (16320 more)]
  Statistics: min=0.013550, max=1.468764, mean=0.913034, sum=14959.146484

bias_tensor Data:
  Type: ValueSpec(type=Tensor, sizes=[128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM)
  Total elements: 128
  Data (first 64 elements): [0.669434, -0.134888, -0.790039, -0.936035, 0.489258, 0.255615, -0.278809, -0.630859, -0.281250, -0.407227, 0.218384, -0.485840, -0.212402, 0.359619, -0.181763, -0.434814, 0.019791, 0.833496, 0.420166, -0.583496, 0.920898, 0.262939, -0.086731, 0.597168, -0.144653, 0.065247, -0.772949, -0.650391, -0.563965, -0.645020, 0.914551, 0.467285, 0.886230, -0.200317, 0.763184, 0.024109, 0.292725, -0.917969, -0.572266, -0.051605, 0.273438, -0.978027, -0.721680, 0.602051, -0.082581, -0.718262, 0.747559, -0.410889, -0.482910, 0.708496, 0.329590, 0.599609, 0.725098, -0.226318, -0.702148, 0.677734, 0.125854, -0.276367, -0.681641, 0.816406, -0.653809, -0.542480, -0.791504, -0.032104, ... (64 more)] 
  Statistics: min=0.289590, max=1.467422, mean=0.940261, sum=120.353394
Executing 1 test cases for FPLinear
----------------------------------------------------------------------
==================== Shared Object List ====================
   idx               sizes                   users
     0                  16                    [7,]
==================== Value List ============================
   idx      type               sizes node_type  storage_bytes    so_idx
     0    TENSOR            [4, 128]     INPUT           1024
     1   STAGING
     2 TENSORREF          [128, 128]   PREPACK
     3 TENSORREF               [128]   PREPACK
     4       INT
     5    TENSOR            [4, 128]    OUTPUT           1024
     6    TENSOR          [128, 128]   PREPACK          32768
     7    TENSOR                  []                       16         0
     8    TENSOR               [128]   PREPACK            256
     9   STAGING
==================== Prepack Node List =====================
   idx                     shader_name    tref  packed
     0pack_fp_linear_weight_buffer_half       2       6
     1        nchw_to_buffer_half_half       3       8
==================== Execute Node List =====================
   idx                     shader_name                READ_arg               WRITE_arg
     0       nchw_to_buffer_half_float                    [1,]                    [0,]
     1        linear_tiled_nv_cm2_half                [0,6,8,]                    [5,]
     2       buffer_to_nchw_half_float                    [5,]                    [9,]

Output[0] Data:
  Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=ZEROS)
  Total elements: 512
  Data (first 20 elements): [-1.498047, -3.060547, -0.001218, -2.798828, 1.934570, 2.558594, 3.414062, 5.820312, -2.109375, -3.806641, 3.628906, 0.022491, -1.992188, -1.054688, 0.149658, -8.593750, -0.514648, 8.828125, 3.468750, -1.464844, ... (492 more)]
  Statistics: min=0.101169, max=1.581347, mean=1.030442, sum=527.586243

Output[0] (ref) Data:
  Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=ZEROS)
  Total elements: 512
  Data (first 20 elements): [-1.499553, -3.061874, -0.002150, -2.799306, 1.934955, 2.555415, 3.412702, 5.819950, -2.108622, -3.801892, 3.630069, 0.021923, -1.991505, -1.054845, 0.149873, -8.598531, -0.516323, 8.826857, 3.467027, -1.466808, ... (492 more)]
  Statistics: min=0.101169, max=1.581347, mean=1.030442, sum=527.586243
Correctness validation PASSED
linear_tiled_nv_cm2_half                           ACCU  B=4  I=128  O=128  Buf  fp16+bias L                                 [4x128]           5.920 ╬╝s         11.070 GFLOP/s   PASSED
----------------------------------------------------------------------

Differential Revision: D91945037

Summary: Add NV cooperative matrix extension if available in the target platform Differential Revision: D92570456

Summary: A float linear op for testing new linear layer implementations. Differential Revision: D91945036

Summary: The new utility file provides a function to query and print cooperative matrix properties supported by the device. The Buck build file `targets.bzl` is updated to include the new source and header files. Reviewed By: SS-JIA Differential Revision: D93009793

…tension Summary: Experimental FP Linear Implementation with NV cooperativate matrix 2 ``` buck run fbcode/mode/win //xplat/executorch/backends/vulkan/test/custom_ops:test_fp_linear >> File changed: fbcode//executorch/backends/vulkan/runtime/gen_vulkan_spv.py File changed: fbcode//executorch/backends/vulkan/runtime/graph/ops/glsl/pack_fp_linear_weight.yaml File changed: fbcode//executorch/backends/vulkan/runtime/graph/ops/impl/LinearExperimental.cpp 15 additional file change events Buck UI: https://www.internalfb.com/buck2/34f0710d-d349-4cba-9e35-10926968dd39 Network: Up: 0B Down: 0B Command: run. Time elapsed: 19.0s BUILD SUCCEEDED - starting your binary === Compute Shader Performance Benchmark === FP32/FP16 Linear Layer Benchmark ---------------------------------------------------------------------- === Cooperative Matrix Properties === Loader Message 0 Inserted device layer "VK_LAYER_KHRONOS_validation" (C:\VulkanSDK\1.4.321.1\Bin\.\VkLayer_khronos_validation.dll) Loader Message 0 Inserted device layer "VK_LAYER_NV_present" (C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll) Loader Message 0 Inserted device layer "VK_LAYER_NV_optimus" (C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll) Loader Message 0 vkCreateDevice layer callstack setup to: Loader Message 0 <Application> Loader Message 0 || Loader Message 0 <Loader> Loader Message 0 || Loader Message 0 VK_LAYER_NV_optimus Loader Message 0 Type: Implicit Loader Message 0 Enabled By: Implicit Layer Loader Message 0 Disable Env Var: DISABLE_LAYER_NV_OPTIMUS_1 Loader Message 0 Manifest: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\nv-vk64.json Loader Message 0 Library: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll Loader Message 0 || Loader Message 0 VK_LAYER_NV_present Loader Message 0 Type: Implicit Loader Message 0 Enabled By: Implicit Layer Loader Message 0 Disable Env Var: DISABLE_LAYER_NV_PRESENT_1 Loader Message 0 Manifest: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\nv-vk64.json Loader Message 0 Library: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll Loader Message 0 || Loader Message 0 VK_LAYER_KHRONOS_validation Loader Message 0 Type: Explicit Loader Message 0 Enabled By: By the Application Loader Message 0 Manifest: C:\VulkanSDK\1.4.321.1\Bin\VkLayer_khronos_validation.json Loader Message 0 Library: C:\VulkanSDK\1.4.321.1\Bin\.\VkLayer_khronos_validation.dll Loader Message 0 || Loader Message 0 <Device> Loader Message 0 Using "NVIDIA GeForce RTX 5080" with driver: "C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll" Validation 0 vkCreateImage(): The following VkImageCreateInfo returned VK_ERROR_FORMAT_NOT_SUPPORTED when calling vkGetPhysicalDeviceImageFormatProperties2 format (VK_FORMAT_R32G32B32A32_SFLOAT) type (VK_IMAGE_TYPE_3D) tiling (VK_IMAGE_TILING_LINEAR) usage (VK_IMAGE_USAGE_SAMPLED_BIT|VK_IMAGE_USAGE_STORAGE_BIT) flags (VkImageCreateFlags(0)) VkImageCreateInfo::pNext is NULL. The Vulkan spec states: Each of the following values (as described in Image Creation Limits) must not be undefined : imageCreateMaxMipLevels, imageCreateMaxArrayLayers, imageCreateMaxExtent, and imageCreateSampleCounts (https://vulkan.lunarg.com/doc/view/1.4.321.1/windows/antora/spec/latest/chapters/resources.html#VUID-VkImageCreateInfo-imageCreateMaxMipLevels-02251) Found 15 cooperative matrix configurations: ---------------------------------------------------------------------- # | M | N | K | A Type | B Type | C Type | R Type | Scope ---------------------------------------------------------------------- 0 | 16 | 16 | 16 | float16 | float16 | float16 | float16 | Subgroup 1 | 16 | 8 | 16 | float16 | float16 | float16 | float16 | Subgroup 2 | 16 | 8 | 8 | float16 | float16 | float16 | float16 | Subgroup 3 | 16 | 16 | 16 | float16 | float16 | float32 | float32 | Subgroup 4 | 16 | 8 | 16 | float16 | float16 | float32 | float32 | Subgroup 5 | 16 | 8 | 8 | float16 | float16 | float32 | float32 | Subgroup 6 | 16 | 16 | 32 | uint8 | uint8 | uint32 | uint32 | Subgroup 7 | 16 | 16 | 32 | int8 | int8 | int32 | int32 | Subgroup 8 | 16 | 8 | 32 | uint8 | uint8 | uint32 | uint32 | Subgroup 9 | 16 | 8 | 32 | int8 | int8 | int32 | int32 | Subgroup 10 | 16 | 16 | 16 | unknown | unknown | float32 | float32 | Subgroup 11 | 16 | 16 | 32 | unknown | unknown | float16 | float16 | Subgroup 12 | 16 | 16 | 32 | unknown | unknown | float32 | float32 | Subgroup 13 | 16 | 16 | 32 | unknown | unknown | float16 | float16 | Subgroup 14 | 16 | 16 | 32 | unknown | unknown | float32 | float32 | Subgroup ---------------------------------------------------------------------- Configurations with float32 A, B, C types: Configurations with float16 A/B, float32 C (mixed precision): M=16, N=16, K=16, Scope=Subgroup M=16, N=8, K=16, Scope=Subgroup M=16, N=8, K=8, Scope=Subgroup Test: ACCU B=4 I=128 O=128 Buf fp16+bias L input_tensor Data: Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM) Total elements: 512 Data (first 64 elements): [-0.250732, 0.592773, 0.901367, -0.632812, 0.463867, 0.559082, 0.197266, 0.193604, -0.687500, -0.108276, -0.687988, -0.799805, -0.883789, -0.081482, 0.731934, -0.332520, 0.202148, -0.713867, 0.416016, 0.301758, -0.958496, -0.886719, 0.939453, 0.443848, 0.664551, 0.876953, -0.575195, -0.998047, -0.636230, 0.984375, -0.632812, 0.234863, -0.391357, 0.223267, 0.049500, -0.985840, -0.136108, -0.953613, -0.417480, 0.049530, 0.223633, -0.200195, -0.720703, -0.906250, -0.415527, 0.947266, -0.267090, -0.534180, -0.087830, -0.818359, 0.570312, 0.236694, -0.600586, -0.234985, 0.028458, 0.966309, 0.184814, -0.066467, -0.906738, 0.719727, 0.215088, 0.360596, -0.658691, -0.098999, ... (448 more)] Statistics: min=0.229682, max=1.468703, mean=0.922048, sum=472.088684 weight_tensor Data: Type: ValueSpec(type=Tensor, sizes=[128, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM) Total elements: 16384 Data (first 64 elements): [-0.769531, -0.006275, 0.218018, -0.793945, -0.732910, -0.702148, -0.518555, -0.655273, -0.345703, 0.622070, 0.718262, -0.850098, 0.332031, -0.909668, 0.082275, 0.224243, -0.941895, 0.189575, 0.467285, -0.498535, -0.210083, -0.705078, 0.604004, -0.977051, -0.490967, -0.063721, -0.885742, 0.909180, 0.732910, 0.111511, -0.557617, 0.299316, -0.189941, 0.161743, -0.367676, -0.662109, -0.846191, -0.168213, 0.686035, 0.305908, 0.697754, 0.241089, 0.942871, -0.192383, -0.229126, 0.747070, 0.908691, 0.110107, -0.108459, -0.142090, 0.339355, -0.720215, -0.834961, -0.539062, 0.793945, -0.605469, -0.403809, 0.596680, -0.475342, -0.335205, -0.989258, -0.157715, 0.086365, -0.110352, ... (16320 more)] Statistics: min=0.013550, max=1.468764, mean=0.913034, sum=14959.146484 bias_tensor Data: Type: ValueSpec(type=Tensor, sizes=[128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM) Total elements: 128 Data (first 64 elements): [0.669434, -0.134888, -0.790039, -0.936035, 0.489258, 0.255615, -0.278809, -0.630859, -0.281250, -0.407227, 0.218384, -0.485840, -0.212402, 0.359619, -0.181763, -0.434814, 0.019791, 0.833496, 0.420166, -0.583496, 0.920898, 0.262939, -0.086731, 0.597168, -0.144653, 0.065247, -0.772949, -0.650391, -0.563965, -0.645020, 0.914551, 0.467285, 0.886230, -0.200317, 0.763184, 0.024109, 0.292725, -0.917969, -0.572266, -0.051605, 0.273438, -0.978027, -0.721680, 0.602051, -0.082581, -0.718262, 0.747559, -0.410889, -0.482910, 0.708496, 0.329590, 0.599609, 0.725098, -0.226318, -0.702148, 0.677734, 0.125854, -0.276367, -0.681641, 0.816406, -0.653809, -0.542480, -0.791504, -0.032104, ... (64 more)] Statistics: min=0.289590, max=1.467422, mean=0.940261, sum=120.353394 Executing 1 test cases for FPLinear ---------------------------------------------------------------------- ==================== Shared Object List ==================== idx sizes users 0 16 [7,] ==================== Value List ============================ idx type sizes node_type storage_bytes so_idx 0 TENSOR [4, 128] INPUT 1024 1 STAGING 2 TENSORREF [128, 128] PREPACK 3 TENSORREF [128] PREPACK 4 INT 5 TENSOR [4, 128] OUTPUT 1024 6 TENSOR [128, 128] PREPACK 32768 7 TENSOR [] 16 0 8 TENSOR [128] PREPACK 256 9 STAGING ==================== Prepack Node List ===================== idx shader_name tref packed 0pack_fp_linear_weight_buffer_half 2 6 1 nchw_to_buffer_half_half 3 8 ==================== Execute Node List ===================== idx shader_name READ_arg WRITE_arg 0 nchw_to_buffer_half_float [1,] [0,] 1 linear_tiled_nv_cm2_half [0,6,8,] [5,] 2 buffer_to_nchw_half_float [5,] [9,] Output[0] Data: Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=ZEROS) Total elements: 512 Data (first 20 elements): [-1.498047, -3.060547, -0.001218, -2.798828, 1.934570, 2.558594, 3.414062, 5.820312, -2.109375, -3.806641, 3.628906, 0.022491, -1.992188, -1.054688, 0.149658, -8.593750, -0.514648, 8.828125, 3.468750, -1.464844, ... (492 more)] Statistics: min=0.101169, max=1.581347, mean=1.030442, sum=527.586243 Output[0] (ref) Data: Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=ZEROS) Total elements: 512 Data (first 20 elements): [-1.499553, -3.061874, -0.002150, -2.799306, 1.934955, 2.555415, 3.412702, 5.819950, -2.108622, -3.801892, 3.630069, 0.021923, -1.991505, -1.054845, 0.149873, -8.598531, -0.516323, 8.826857, 3.467027, -1.466808, ... (492 more)] Statistics: min=0.101169, max=1.581347, mean=1.030442, sum=527.586243 Correctness validation PASSED linear_tiled_nv_cm2_half ACCU B=4 I=128 O=128 Buf fp16+bias L [4x128] 5.920 ╬╝s 11.070 GFLOP/s PASSED ---------------------------------------------------------------------- ```` Differential Revision: D91945037

pytorch-bot · 2026-02-17T19:37:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17501

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⚠️ 7 Awaiting Approval

As of commit 2f7daf2 with merge base a24d3e7 ():

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-02-17T19:37:32Z

@HarryHu-art has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91945037.

github-actions · 2026-02-17T19:38:13Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

github-actions · 2026-04-19T01:13:05Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

HarryHu-art added 4 commits February 17, 2026 11:36

Add NV cooperative matrix extension (pytorch#17497)

dae6c2e

Summary: Add NV cooperative matrix extension if available in the target platform Differential Revision: D92570456

FP Linear benchmark + test op

cab3587

Summary: A float linear op for testing new linear layer implementations. Differential Revision: D91945036

HarryHu-art requested a review from SS-JIA as a code owner February 17, 2026 19:36

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 17, 2026

meta-codesync Bot added fb-exported meta-exported labels Feb 17, 2026

xuyanwen2012 mentioned this pull request Apr 6, 2026

[ET-VK] Add VK_KHR_cooperative_matrix MatMul shaders and benchmark #18726

Closed

github-actions Bot added the stale PRs inactive for over 60 days label Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental FP Linear Implementation with NV cooperative matrix 2 extension#17501

Experimental FP Linear Implementation with NV cooperative matrix 2 extension#17501
HarryHu-art wants to merge 4 commits intopytorch:mainfrom
HarryHu-art:export-D91945037

HarryHu-art commented Feb 17, 2026

Uh oh!

pytorch-bot Bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Feb 17, 2026

Uh oh!

github-actions Bot commented Feb 17, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HarryHu-art commented Feb 17, 2026

Uh oh!

pytorch-bot Bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17501

⚠️ 7 Awaiting Approval

Uh oh!

meta-codesync Bot commented Feb 17, 2026

Uh oh!

github-actions Bot commented Feb 17, 2026

This PR needs a release notes: label

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented Feb 17, 2026 •

edited

Loading

This PR needs a `release notes:` label