Skip to content

Conversation

henrylhtsang
Copy link
Contributor

@henrylhtsang henrylhtsang commented May 10, 2025

Stack from ghstack (oldest at bottom):

Motivation:
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2

AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1

with prescreening:

AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Copy link

pytorch-bot bot commented May 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153335

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2e7b8ff with merge base 59c5fff (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

henrylhtsang added a commit that referenced this pull request May 10, 2025
@henrylhtsang henrylhtsang added the topic: not user facing topic category label May 10, 2025
@henrylhtsang henrylhtsang requested review from eellison and mlazos May 12, 2025 20:28
…screening"


Motivation: 
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
henrylhtsang added a commit that referenced this pull request May 13, 2025
…screening"


Motivation: 
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
henrylhtsang added a commit that referenced this pull request May 13, 2025
…screening"


Motivation: 
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
…screening"


Motivation: 
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
…screening"


Motivation: 
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
henrylhtsang added a commit that referenced this pull request May 15, 2025
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, maybe @ColinPeppler @mlazos as well


log.debug("Before pruning using prescreening timings, %d choices", len(choices))
sorted_choices = sorted(timings.keys(), key=lambda choice: timings[choice])
num_to_keep = max(int(math.sqrt(len(choices)) / 4), 8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be config-able ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be config-able ?

I was worried about adding too many configs, so want to hold on a bit.

# reopen DLL
for choice in choices_to_keep:
if isinstance(choice, CUDATemplateCaller):
choice.bmreq.ensure_dll_loaded()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed ? the dll_loaded will happen elsewhere as well right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed ? the dll_loaded will happen elsewhere as well right ?

This is needed otherwise we will lose out the one we just tuned. I can double check again.

choices_to_prune = OrderedSet(
choice.hash_key()
for pos, choice in enumerate(sorted_choices)
if pos > num_to_keep or timings[choice] == float("inf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this if pos > num_to_keep ? won't that already be achieved with sorted_choices[:num_to_keep] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this if pos > num_to_keep ? won't that already be achieved with sorted_choices[:num_to_keep] ?

I want to make sure we didn't add the ones with timings=inf to choices.

if pos > num_to_keep or timings[choice] == float("inf")
)
choices = [
choice for choice in choices if choice.hash_key() not in choices_to_prune
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we need choices_to_prune and choices_to_keep ? won't those be mutually exclusive sets ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we need choices_to_prune and choices_to_keep ? won't those be mutually exclusive sets ?

they are mutually exclusive. I wrote it this way because I didn't want to check if choice in choices_to_prune many times, which is inefficient. Let me re-check if there is a better way

@staticmethod
def prescreen_choices(
choices: list[ChoiceCaller],
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return type

mypy keep telling me to return list[CUDATemplateCaller] instead of list[ChoiceCaller]...

Comment on lines +2388 to +2389
if len(candidates) < 10:
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I guess the logic is, if we have a small enough number of candidates, we don't need to prune ? add comment ?

@staticmethod
def prune_choices_postscreen(
choices: list[ChoiceCaller],
timings: dict[ChoiceCaller, float],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: more explicit timings ? the timings here are a subset of choices, which isnt necessarily clear

@henrylhtsang henrylhtsang requested a review from ColinPeppler May 15, 2025 20:35
…screening"


Motivation: 
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
henrylhtsang added a commit that referenced this pull request May 16, 2025
…screening"


Motivation: 
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
@henrylhtsang
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 19, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

…screening"


Motivation: 
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
henrylhtsang added a commit that referenced this pull request May 20, 2025
@henrylhtsang
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@yangw-dev
Copy link
Contributor

@pytorchbot revert -m "sorry the pr is failed internally D75155648" -c ghfirst

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request May 22, 2025
…eening (#153335)"

This reverts commit d237629.

Reverted #153335 on behalf of https://github.com/yangw-dev due to sorry the pr is failed internally [D75155648](https://www.internalfb.com/diff/D75155648) ([comment](#153335 (comment)))
@pytorchmergebot
Copy link
Collaborator

@henrylhtsang your PR has been successfully reverted.

@henrylhtsang
Copy link
Contributor Author

What I did wrong was changing the hash key, which can change the precompilation seen choices logic.

…screening"


Motivation: 
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
…screening"


Motivation: 
By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive.

Observations:
Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time.

I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations.

Logs:
Baseline:
https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2
```
AUTOTUNE mm(2048x2048, 2048x2048)
strides: [2048, 1], [1, 2048]
dtypes: torch.bfloat16, torch.bfloat16
  cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
```

with prescreening:
```
AUTOTUNE mm(147456x6144, 6144x2048)
strides: [6144, 1], [2048, 1]
dtypes: torch.bfloat16, torch.bfloat16
  cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8
  cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4
  cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1
  cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2
  cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8
SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
henrylhtsang added a commit that referenced this pull request May 23, 2025
@henrylhtsang
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@henrylhtsang
Copy link
Contributor Author

fyi I think I didn't do the hashing correctly...

@github-actions github-actions bot deleted the gh/henrylhtsang/78/head branch June 27, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor Reverted topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants