…tension
Summary:
Experimental FP Linear Implementation with NV cooperativate matrix 2
```
buck run fbcode/mode/win //xplat/executorch/backends/vulkan/test/custom_ops:test_fp_linear
>>
File changed: fbcode//executorch/backends/vulkan/runtime/gen_vulkan_spv.py
File changed: fbcode//executorch/backends/vulkan/runtime/graph/ops/glsl/pack_fp_linear_weight.yaml
File changed: fbcode//executorch/backends/vulkan/runtime/graph/ops/impl/LinearExperimental.cpp
15 additional file change events
Buck UI: https://www.internalfb.com/buck2/34f0710d-d349-4cba-9e35-10926968dd39
Network: Up: 0B Down: 0B
Command: run.
Time elapsed: 19.0s
BUILD SUCCEEDED - starting your binary
=== Compute Shader Performance Benchmark ===
FP32/FP16 Linear Layer Benchmark
----------------------------------------------------------------------
=== Cooperative Matrix Properties ===
Loader Message 0 Inserted device layer "VK_LAYER_KHRONOS_validation" (C:\VulkanSDK\1.4.321.1\Bin\.\VkLayer_khronos_validation.dll)
Loader Message 0 Inserted device layer "VK_LAYER_NV_present" (C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll)
Loader Message 0 Inserted device layer "VK_LAYER_NV_optimus" (C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll)
Loader Message 0 vkCreateDevice layer callstack setup to:
Loader Message 0 <Application>
Loader Message 0 ||
Loader Message 0 <Loader>
Loader Message 0 ||
Loader Message 0 VK_LAYER_NV_optimus
Loader Message 0 Type: Implicit
Loader Message 0 Enabled By: Implicit Layer
Loader Message 0 Disable Env Var: DISABLE_LAYER_NV_OPTIMUS_1
Loader Message 0 Manifest: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\nv-vk64.json
Loader Message 0 Library: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll
Loader Message 0 ||
Loader Message 0 VK_LAYER_NV_present
Loader Message 0 Type: Implicit
Loader Message 0 Enabled By: Implicit Layer
Loader Message 0 Disable Env Var: DISABLE_LAYER_NV_PRESENT_1
Loader Message 0 Manifest: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\nv-vk64.json
Loader Message 0 Library: C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll
Loader Message 0 ||
Loader Message 0 VK_LAYER_KHRONOS_validation
Loader Message 0 Type: Explicit
Loader Message 0 Enabled By: By the Application
Loader Message 0 Manifest: C:\VulkanSDK\1.4.321.1\Bin\VkLayer_khronos_validation.json
Loader Message 0 Library: C:\VulkanSDK\1.4.321.1\Bin\.\VkLayer_khronos_validation.dll
Loader Message 0 ||
Loader Message 0 <Device>
Loader Message 0 Using "NVIDIA GeForce RTX 5080" with driver: "C:\Windows\System32\DriverStore\FileRepository\nvlei.inf_amd64_28d22cc4fdec4a49\.\nvoglv64.dll"
Validation 0 vkCreateImage(): The following VkImageCreateInfo returned VK_ERROR_FORMAT_NOT_SUPPORTED when calling vkGetPhysicalDeviceImageFormatProperties2
format (VK_FORMAT_R32G32B32A32_SFLOAT)
type (VK_IMAGE_TYPE_3D)
tiling (VK_IMAGE_TILING_LINEAR)
usage (VK_IMAGE_USAGE_SAMPLED_BIT|VK_IMAGE_USAGE_STORAGE_BIT)
flags (VkImageCreateFlags(0))
VkImageCreateInfo::pNext is NULL.
The Vulkan spec states: Each of the following values (as described in Image Creation Limits) must not be undefined : imageCreateMaxMipLevels, imageCreateMaxArrayLayers, imageCreateMaxExtent, and imageCreateSampleCounts (https://vulkan.lunarg.com/doc/view/1.4.321.1/windows/antora/spec/latest/chapters/resources.html#VUID-VkImageCreateInfo-imageCreateMaxMipLevels-02251)
Found 15 cooperative matrix configurations:
----------------------------------------------------------------------
# | M | N | K | A Type | B Type | C Type | R Type | Scope
----------------------------------------------------------------------
0 | 16 | 16 | 16 | float16 | float16 | float16 | float16 | Subgroup
1 | 16 | 8 | 16 | float16 | float16 | float16 | float16 | Subgroup
2 | 16 | 8 | 8 | float16 | float16 | float16 | float16 | Subgroup
3 | 16 | 16 | 16 | float16 | float16 | float32 | float32 | Subgroup
4 | 16 | 8 | 16 | float16 | float16 | float32 | float32 | Subgroup
5 | 16 | 8 | 8 | float16 | float16 | float32 | float32 | Subgroup
6 | 16 | 16 | 32 | uint8 | uint8 | uint32 | uint32 | Subgroup
7 | 16 | 16 | 32 | int8 | int8 | int32 | int32 | Subgroup
8 | 16 | 8 | 32 | uint8 | uint8 | uint32 | uint32 | Subgroup
9 | 16 | 8 | 32 | int8 | int8 | int32 | int32 | Subgroup
10 | 16 | 16 | 16 | unknown | unknown | float32 | float32 | Subgroup
11 | 16 | 16 | 32 | unknown | unknown | float16 | float16 | Subgroup
12 | 16 | 16 | 32 | unknown | unknown | float32 | float32 | Subgroup
13 | 16 | 16 | 32 | unknown | unknown | float16 | float16 | Subgroup
14 | 16 | 16 | 32 | unknown | unknown | float32 | float32 | Subgroup
----------------------------------------------------------------------
Configurations with float32 A, B, C types:
Configurations with float16 A/B, float32 C (mixed precision):
M=16, N=16, K=16, Scope=Subgroup
M=16, N=8, K=16, Scope=Subgroup
M=16, N=8, K=8, Scope=Subgroup
Test: ACCU B=4 I=128 O=128 Buf fp16+bias L
input_tensor Data:
Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM)
Total elements: 512
Data (first 64 elements): [-0.250732, 0.592773, 0.901367, -0.632812, 0.463867, 0.559082, 0.197266, 0.193604, -0.687500, -0.108276, -0.687988, -0.799805, -0.883789, -0.081482, 0.731934, -0.332520, 0.202148, -0.713867, 0.416016, 0.301758, -0.958496, -0.886719, 0.939453, 0.443848, 0.664551, 0.876953, -0.575195, -0.998047, -0.636230, 0.984375, -0.632812, 0.234863, -0.391357, 0.223267, 0.049500, -0.985840, -0.136108, -0.953613, -0.417480, 0.049530, 0.223633, -0.200195, -0.720703, -0.906250, -0.415527, 0.947266, -0.267090, -0.534180, -0.087830, -0.818359, 0.570312, 0.236694, -0.600586, -0.234985, 0.028458, 0.966309, 0.184814, -0.066467, -0.906738, 0.719727, 0.215088, 0.360596, -0.658691, -0.098999, ... (448 more)]
Statistics: min=0.229682, max=1.468703, mean=0.922048, sum=472.088684
weight_tensor Data:
Type: ValueSpec(type=Tensor, sizes=[128, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM)
Total elements: 16384
Data (first 64 elements): [-0.769531, -0.006275, 0.218018, -0.793945, -0.732910, -0.702148, -0.518555, -0.655273, -0.345703, 0.622070, 0.718262, -0.850098, 0.332031, -0.909668, 0.082275, 0.224243, -0.941895, 0.189575, 0.467285, -0.498535, -0.210083, -0.705078, 0.604004, -0.977051, -0.490967, -0.063721, -0.885742, 0.909180, 0.732910, 0.111511, -0.557617, 0.299316, -0.189941, 0.161743, -0.367676, -0.662109, -0.846191, -0.168213, 0.686035, 0.305908, 0.697754, 0.241089, 0.942871, -0.192383, -0.229126, 0.747070, 0.908691, 0.110107, -0.108459, -0.142090, 0.339355, -0.720215, -0.834961, -0.539062, 0.793945, -0.605469, -0.403809, 0.596680, -0.475342, -0.335205, -0.989258, -0.157715, 0.086365, -0.110352, ... (16320 more)]
Statistics: min=0.013550, max=1.468764, mean=0.913034, sum=14959.146484
bias_tensor Data:
Type: ValueSpec(type=Tensor, sizes=[128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=RANDOM)
Total elements: 128
Data (first 64 elements): [0.669434, -0.134888, -0.790039, -0.936035, 0.489258, 0.255615, -0.278809, -0.630859, -0.281250, -0.407227, 0.218384, -0.485840, -0.212402, 0.359619, -0.181763, -0.434814, 0.019791, 0.833496, 0.420166, -0.583496, 0.920898, 0.262939, -0.086731, 0.597168, -0.144653, 0.065247, -0.772949, -0.650391, -0.563965, -0.645020, 0.914551, 0.467285, 0.886230, -0.200317, 0.763184, 0.024109, 0.292725, -0.917969, -0.572266, -0.051605, 0.273438, -0.978027, -0.721680, 0.602051, -0.082581, -0.718262, 0.747559, -0.410889, -0.482910, 0.708496, 0.329590, 0.599609, 0.725098, -0.226318, -0.702148, 0.677734, 0.125854, -0.276367, -0.681641, 0.816406, -0.653809, -0.542480, -0.791504, -0.032104, ... (64 more)]
Statistics: min=0.289590, max=1.467422, mean=0.940261, sum=120.353394
Executing 1 test cases for FPLinear
----------------------------------------------------------------------
==================== Shared Object List ====================
idx sizes users
0 16 [7,]
==================== Value List ============================
idx type sizes node_type storage_bytes so_idx
0 TENSOR [4, 128] INPUT 1024
1 STAGING
2 TENSORREF [128, 128] PREPACK
3 TENSORREF [128] PREPACK
4 INT
5 TENSOR [4, 128] OUTPUT 1024
6 TENSOR [128, 128] PREPACK 32768
7 TENSOR [] 16 0
8 TENSOR [128] PREPACK 256
9 STAGING
==================== Prepack Node List =====================
idx shader_name tref packed
0pack_fp_linear_weight_buffer_half 2 6
1 nchw_to_buffer_half_half 3 8
==================== Execute Node List =====================
idx shader_name READ_arg WRITE_arg
0 nchw_to_buffer_half_float [1,] [0,]
1 linear_tiled_nv_cm2_half [0,6,8,] [5,]
2 buffer_to_nchw_half_float [5,] [9,]
Output[0] Data:
Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=ZEROS)
Total elements: 512
Data (first 20 elements): [-1.498047, -3.060547, -0.001218, -2.798828, 1.934570, 2.558594, 3.414062, 5.820312, -2.109375, -3.806641, 3.628906, 0.022491, -1.992188, -1.054688, 0.149658, -8.593750, -0.514648, 8.828125, 3.468750, -1.464844, ... (492 more)]
Statistics: min=0.101169, max=1.581347, mean=1.030442, sum=527.586243
Output[0] (ref) Data:
Type: ValueSpec(type=Tensor, sizes=[4, 128], dtype=half, memory_layout=WidthPacked, storage_type=Buffer, data_gen=ZEROS)
Total elements: 512
Data (first 20 elements): [-1.499553, -3.061874, -0.002150, -2.799306, 1.934955, 2.555415, 3.412702, 5.819950, -2.108622, -3.801892, 3.630069, 0.021923, -1.991505, -1.054845, 0.149873, -8.598531, -0.516323, 8.826857, 3.467027, -1.466808, ... (492 more)]
Statistics: min=0.101169, max=1.581347, mean=1.030442, sum=527.586243
Correctness validation PASSED
linear_tiled_nv_cm2_half ACCU B=4 I=128 O=128 Buf fp16+bias L [4x128] 5.920 ╬╝s 11.070 GFLOP/s PASSED
----------------------------------------------------------------------
````
Differential Revision: D91945037
Summary:
Experimental FP Linear Implementation with NV cooperativate matrix 2
Differential Revision: D91945037