Skip to content

v1.1.0

Latest

Choose a tag to compare

@oulgen oulgen released this 20 May 23:32
· 188 commits to main since this release
d807139

What's Changed

  • [Pallas] Use TPU-aware synchronize_device across autotuner and testing by @norx1991 in #1944
  • Refactor if-else branches, unblock test_if_new_variable_in_static_range for Pallas TPU by @AmesingFlank in #1935
  • [Pallas] Register pallas_loop_type only when inner loops exist by @norx1991 in #1915
  • [Pallas] Fix emit_pipeline program_id mapping with loop_order reordering by @norx1991 in #1916
  • [Pallas] Add expected-failure test for non-power-of-2 RDIM size by @norx1991 in #1945
  • Update docs: add missing API entries by @choijon5 in #1948
  • [Benchmark] Add compile time measurement to CI benchmarks by @choijon5 in #1952
  • use is_symm_mem_tensor if the API is available by @shunting314 in #1933
  • [Pallas] When not doing tiling for entire kernel, use explicit BlockSpecs instead of None by @AmesingFlank in #1960
  • Fix FROM_BEST_AVAILABLE matching with hl.specialize() after #1883 by @fulvius31 in #1940
  • [Pallas] Fix indexing scalars using SMEM memory space by @AmesingFlank in #1955
  • [Pallas] Add xfail tests for scalar .begin index not collapsing dims by @norx1991 in #1971
  • Skip test_hl_rand_mixed_argument_order on MTIA due to unaligned address crash by @karthickai in #1977
  • Fix _supports_maxnreg() to guard against non-CUDA backends by @karthickai in #1981
  • Skip TestRandomPhiloxParity class on MTIA (#1979) by @karthickai in #1982
  • [Pallas] Fix acccesing tensors using index from hl.grid() by @AmesingFlank in #1956
  • [Pallas] Support using traced size-1 tensor as condition predicate, unblocking test_if_arg_indexed_scalar by @AmesingFlank in #1957
  • [Pallas] Fix atomic_add dtype cast and VMEM preload for fori/pipeline launchers by @thcmbs in #1966
  • [Pallas] Use exact RDIM size instead of next-power-of-2 by @norx1991 in #1954
  • [Pallas] Add support for accessing tensors with the pattern of tile.index + offset /.id/.begin/.end by @AmesingFlank in #1968
  • [metal] Reuse Inductor's MetalOverrides for MSL expression emission by @aditvenk in #1853
  • [metal] Add Metal codegen handlers for load, store, and mask_to by @aditvenk in #1854
  • [Pallas] Enable a subset of test_grid tests for Pallas by @AmesingFlank in #1985
  • [Pallas] Add a test for accessing tensors with hl.grid() index + offset by @AmesingFlank in #1988
  • Relax rms_norm example tolerance for Pallas bf16 by @thcmbs in #1983
  • [autotuner] Introduce BenchmarkProvider abstraction for kernel benchmarking by @hinriksnaer in #1928
  • [Pallas] Use HBM BlockSpecs for output-only tensors to save VMEM by @norx1991 in #1984
  • Emit autotune failure summary warnings by @allgather in #1994
  • [Pallas] Remove unused is_device_loop variable in _pallas_index_str by @AmesingFlank in #1990
  • [Pallas] Fix BlockSpecs for 2D tl.grid([m, n]), unblocking test_scalar_access_hl_grid_2d by @AmesingFlank in #1986
  • Reduce some autotuner overhead without changing kernel behavior by @svdrecbd in #1885
  • [compile] Add pre_codegen hook to Backend ABC by @hinriksnaer in #1976
  • Adding AMD Mi350x machines to CI with new labels. by @umechand-amd in #1835
  • [Pallas] Fix accessing tensor via hl.grid() index within a device loop, unblocking test_scalar_access_hl_grid_2d_nested by @AmesingFlank in #1989
  • Support mtia in LocalAutotuneCache by @Hamlin-Li in #1996
  • removed stale comments by @hinriksnaer in #1997
  • Fix invalid default config for kernels with large tensor numel by @fulvius31 in #1839
  • chore: Bump actions/github-script from 8 to 9 by @dependabot[bot] in #2000
  • Fix _benchmark dropping configs that fail compilation by @fulvius31 in #1942
  • [metal] Add MSL AST walker for Python-to-C++ translation by @aditvenk in #1794
  • Enable tensor_descriptor based atomic ops by @ethche in #1953
  • [metal] Add @metal_jit decorator for AST-to-MSL compilation by @aditvenk in #1991
  • Epilogue subtiling: store indexing fix, example, and tuple output support in run_example by @choijon5 in #1907
  • [Compiler] Added backend registry by @hinriksnaer in #1967
  • [metal] Wire @metal_jit into MetalBackend and simplify launcher by @aditvenk in #1992
  • [Pallas] Skip trivial reduction mask when RDIM size equals actual dim by @norx1991 in #1993
  • [Pallas] Fix TPU min_dot_size for matmul autotuning by @norx1991 in #1999
  • chore: Bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #2015
  • Make _get_tile_with_offset_info accept a torch.fx.Node as arg, instead of the entire CodegenState by @AmesingFlank in #2005
  • [metal] Skip tests better to pass in internal CI by @aditvenk in #2016
  • Add to_code() as backend-agnostic alias for to_triton_code() by @norx1991 in #2012
  • [Pallas] Fix fori_loop multi-dim index decomposition with nested loops by @norx1991 in #1917
  • custom config filter by @shunting314 in #1847
  • [Pallas] Reject 64-bit input tensors and fix tiling ZeroDivisionError by @norx1991 in #1950
  • Bump CI libtpu to 0.0.40 by @AmesingFlank in #2017
  • [Pallas] Make test_tensor_access_tile_index_offset more meaningful and reveal codegen issue in Pallas backend by @AmesingFlank in #2006
  • [Pallas] [NFC] Dedup _jnp_dtype_map helper by @thcmbs in #2021
  • [Pallas] Add a plan_tiling pre_codegen pass to make more consistent tiling and indexing decisions by @AmesingFlank in #2007
  • [Compiler] Add reserved_launch_param_names to Backend ABC by @hinriksnaer in #1970
  • [Pallas] Remove non-tiled fallback paths which are no longer used after #2007 by @AmesingFlank in #2026
  • [TPU][Pallas] Add lower bound analytical VMEM estimation and OOM guard for Pallas launchers by @yarongmu-google in #2024
  • [Pallas] Fix codegen for slice indexing when there are squeezed dimensions by @AmesingFlank in #2027
  • [CI] Pin Triton-to-tile-IR to last known-good commit by @norx1991 in #2028
  • [Pallas] Add non-DMA fori_loop fallback for DMA-unaligned inner blocks by @thcmbs in #1969
  • Skip test_matmul_smaller_than_min_dot_size on MTIA (#2030) by @will-cromar in #2030
  • docs: add missing autofunction directives for 9 API functions by @Bodlux in #2036
  • [AMD ROCM] matrix_instr_nonkdim restricted to 16 by @umechand-amd in #2032
  • [Autotuner] Move benchmarking implementation into BenchmarkProvider by @hinriksnaer in #2029
  • [Bug][Autotuner] Autotuner failed to clone mutated input arg by @xiaohongchen1991 in #2042
  • [TPU][Pallas] Fix OOM on large reductions by delegating chunking to Mosaic by @yarongmu-google in #2033
  • Remove ref baseline kernel count for test_clone_with_multiple_views_one_mutated as it depends on PyTorch version by @choijon5 in #2037
  • docs: remove remaining mentions of HELION_USE_AUTOTUNE by @cota in #2038
  • Fix test broken by #2033: use env.backend_name by @norx1991 in #2046
  • Simplify CI: update PyTorch releases to 2.11, use bundled Triton by @choijon5 in #2047
  • [Pallas] Under interpret mode, Use float16 as HALF_DTYPE because bfloat16 is not supported on CPU by @AmesingFlank in #2050
  • [AMD ROCm] Use AMD Triton backend for min_dot_size instead of NVIDIA by @choijon5 in #2048
  • [Pallas] Adjust block size constraints by analyzing subscript exprs on tensors, unblock rms_norm_bwd example and a few tests by @AmesingFlank in #2051
  • ND jagged tile support by @nullplay in #2052
  • New performance dashboard with GitHub Pages deployment by @choijon5 in #2053
  • [Pallas] Fix test_squeeze_slice_access to use code_and_output by @norx1991 in #2057
  • Add Metal test job to CI test matrix by @aditvenk in #1862
  • [Pallas] Add xfail tests for bmm non-divisible reduction by @norx1991 in #2031
  • [Autotuner] Adding LLM-guided search by @choijon5 in #2003
  • [cutedsl] Refactor reductions to use helper methods by @jansel in #2008
  • [cutedsl] Strengthen layout planning pass invariants by @jansel in #2009
  • [cutedsl] Improve dot with epilogue handling by @jansel in #2014
  • [cutedsl] Plan grouped-N matmuls and lower atomic tensor indices by @jansel in #2020
  • [Pallas] Exclude output-only tensors from Pallas pallas_call inputs to improve performance by @norx1991 in #1849
  • [Pallas] Use FakeTensorMode to avoid HBM allocation for output-only tensors by @norx1991 in #2022
  • Add simplified se_block kernel (#989) by @mengluy0125 in #989
  • [Pallas] Fix symbolic offset codegen in TileIndexWithOffsetPattern by @norx1991 in #2068
  • [Pallas] Replace FakeTensorMode wrap with device='meta' for output-only tensors by @norx1991 in #2071
  • [TPU][Pallas] Enable TPU support and fix benchmarking for AOT compilation example by @yarongmu-google in #2059
  • fix misleading benchmarking for fp8 gemm by @shunting314 in #1980
  • add flashinfer allreduce-rmsnorm kernel by @shunting314 in #2063
  • [Autotuner] Make the autotuner robust to InvalidConfig by @bringlein in #2039
  • Deploy new perf dashboard to GitHub Pages by @choijon5 in #2066
  • Add backend-agnostic lane loop APIs to tile strategies by @aditvenk in #1798
  • [Pallas] Process tensor access within external lambdas when adjusting block size constraints by @AmesingFlank in #2073
  • [Pallas] Fix emit_pipeline/fori_loop codegen when multiple inner loops tile the same dim by @norx1991 in #2075
  • use non-interleaved benchmarking for all-reduce-rmsnorm by @shunting314 in #2065
  • Unify dashboard deployment with docs deploy by @choijon5 in #2082
  • [Pallas] Refactor memory space tracking into PallasMemorySpace enum by @norx1991 in #2072
  • Dashboard: restrict Overview/Speedup to main branch, track latency, UI polish by @choijon5 in #2084
  • [Pallas] Treat tile.id subscripts as untileable scalar indices by @norx1991 in #2083
  • [Pallas] Validate pallas_loop_type by @thcmbs in #2055
  • [NFC] [Pallas] Move indexing codegen helpers to pallas/codegen.py by @thcmbs in #2067
  • [Autotuner] Reland LLM-seeded hybrid search (originally #2004) by @choijon5 in #2091
  • Dashboard: fix crash-masking in CI status, split failures (accuracy/run/infra), chart polish by @choijon5 in #2086
  • [Pallas] More robust analysis of tensors reads/writes via FX graph instead of AST, allowing more aggresive output_only optimizations by @AmesingFlank in #2088
  • [Autotuner] Skip temperature for claude-opus-4-7 (HTTP 400) by @choijon5 in #2089
  • [Autotuner] Raise LLM response token budget to fit verbose configs by @choijon5 in #2090
  • Dashboard: dedupe MI325X duplicate entries, rename Runner Failures → No Result with last-seen date by @choijon5 in #2094
  • Dashboard: include cancelled runs so partial artifacts aren't lost to 6h timeout by @choijon5 in #2095
  • Fix hardcoded CUDA device in jagged_dense_bmm example by @norx1991 in #2080
  • [CI] fix 13.2 TileIR ci pipeline by @qelk123 in #2076
  • allow reuse variables across different static loops by @shunting314 in #2081
  • [Pallas] Fix SMEM/VMEM conflict for tensors with mixed access patterns by @norx1991 in #2069
  • Removing MI325 runners from CI by @umechand-amd in #2096
  • Benchmark: Reduce running Tritonbench for each kernel from twice to once. by @choijon5 in #2097
  • [ROCM ] Improves ROCm compatibility for distributed kernels and expands ROCm test coverage in distributed test suites. by @umechand-amd in #2049
  • [dashboard] temporarily remove --existing-url to rebuild cache by @choijon5 in #2109
  • Pin tritonbench commit in a file. by @umechand-amd in #2106
  • [dashboard] re-enable --existing-url and filter platforms from dispatch workflow by @choijon5 in #2110
  • Fix flaky distributed and torch.compile tests by @choijon5 in #2098
  • [Benchmarking/CI] Prevent hangs in benchmark phase via subprocess + per-config run timeout by @choijon5 in #2111
  • [Autotuner] Seed LFBO surrogate with stage-1 LLM benchmarks in hybrid search by @choijon5 in #2113
  • [dashboard] Fix empty dashboard caused by broken GitHub API query filters by @choijon5 in #2118
  • [Pallas] Host-side padding for non-divisible pl.ds() dimensions by @norx1991 in #2104
  • Update tritonbench.txt by @umechand-amd in #2121
  • Fix flaky test due to spurious NaN in fusion autotune accuracy check by @choijon5 in #2115
  • [Autotuner] Skip subprocess sticky CUDA errors instead of aborting autotune by @choijon5 in #2122
  • [Pallas] Adding a tunable pre_broadcast optimization pass for TPU scratch buffers, improving TPU attention perf by @AmesingFlank in #2103
  • [dashboard] Trigger docs-deploy explicitly from benchmark dispatch by @choijon5 in #2129
  • [Pallas] Fix multi-dim padding overwriting original tensor reference by @norx1991 in #2120
  • [Pallas] Extend padding to fori_loop DMA and emit_pipeline via _record_pad_info by @norx1991 in #2105
  • [Autotuner] Handle PyTorch CUDA OOM as a skippable error instead of aborting autotuning by @ethche in #2130
  • [Pallas] Indirect gather with pluggable strategies by @thcmbs in #2054
  • [lint] upgrade to pyrefly 0.63.1 by @oulgen in #2134
  • [lint] upgrade ruff to 0.15.12 by @oulgen in #2135
  • [Pallas] Add pl.multiple_of alignment hint to pl.ds() offsets by @norx1991 in #2116
  • Reduce measure() overhead when compile-time tracking is disabled by @gmagogsfm in #2139
  • [CI] Enable build and test for XPU by @chuanqi129 in #1327
  • [Autotuner] Catch expected errors during fork precompiler setup instead of aborting by @ethche in #2142
  • Optimize cache key computation overhead by @gmagogsfm in #2144
  • [Pallas] Honor _smem_arg_indices in pipeline launchers by @norx1991 in #2143
  • This PR updates the tritonbench commit to its current ToT where the PR to fix the segfaults has landed. by @umechand-amd in #2149
  • [Benchmarking] Disable cudagraph for layer_norm-bwd / rms_norm-bwd by @choijon5 in #2127
  • [docs] Add dashboard link, LLM autotuner docs, remove past events by @choijon5 in #2151
  • [Pallas] Per-dim VMEM accounting for gather budget check by @thcmbs in #2137
  • [cute] Enable TestControlFlow by @oulgen in #2136
  • [cute] Add codegen for hl.split, hl.join, and aten.view.dtype; enable test_views.py by @oulgen in #2138
  • [Pallas] Rename pallas_loop_type "default" to "unroll" by @norx1991 in #2155
  • Remove redundant m_i update in example attention kernel by @AmesingFlank in #2156
  • Use integer arithmetic instead of triton.cdiv in launcher by @gmagogsfm in #2146
  • unbreak docs build by @oulgen in #2162
  • [Pallas] Use torch.addmm in matmul_layernorm K-loop by @norx1991 in #2141
  • [cute] Enable bunch of test suites by @oulgen in #2159
  • Minor runpod updates by @jansel in #2163
  • Update AGENTS.md by @jansel in #2164
  • Add cute-verify skill by @jansel in #2165
  • Add scripts/autoreview.py by @jansel in #2166
  • Run codespell from ./lint.sh by @jansel in #2175
  • [cutedsl] Matmul preformance prework by @jansel in #2167
  • Small attention optimization: pre-scale q tile with qk_scale by @AmesingFlank in #2157
  • [Pallas] Tighten _check_dma_alignment + make "unroll" tests explicit by @norx1991 in #2158
  • [cute] Bump minimum cute version to 4.5 by @oulgen in #2180
  • Fix compile time measurements by @choijon5 in #2188
  • torch_tpu: update pin to 28d941aec27 by @cota in #1895
  • Remove redundant compile time env by @choijon5 in #2191
  • [Pallas] Emit offset/indices at inner-loop body prologue by @norx1991 in #2181
  • [Autotuner] Fix crash when autotuner_min exceeds max_size by @stmcgovern in #2177
  • Fix attention benchmark accuracy by @choijon5 in #2178
  • [Pallas] Enable se_block tests on TPU + simplify skipIfCudaCapabilityLessThan by @norx1991 in #2131
  • [cute] Implement topk and sort by @oulgen in #2160
  • Fix negative shift by @oulgen in #2185
  • [flaky test] skip register cache test on XPU by @choijon5 in #2192
  • Add fix-pr skill by @jansel in #2189
  • Add offsets kwarg to hl.rand for explicit Philox offsets by @karthickai in #2153
  • [Pallas] Per-tensor pipelining decision in fori_loop and emit_pipeline by @norx1991 in #2093
  • [cute] Implement associate scan by @oulgen in #2161
  • Add helion.from_cache() for FiniteSearch warm-start by @fulvius31 in #2079
  • [cute] Enable test_print by @oulgen in #2186
  • [BoundKernel] added _normalize_config by @hinriksnaer in #2152
  • [cutedsl] Improve CuTe tcgen05 matmul autotuning and direct-store epilogues by @jansel in #2168
  • [cutedsl] Compile CuTe launchers once and harden regression coverage by @jansel in #2169
  • [cutedsl] track tcgen05 per-tile setup and register split by @jansel in #2170
  • [cutedsl] add autotune wall-time budget by @jansel in #2171
  • [cutedsl] split tcgen05 persistent post-loop cleanup by @jansel in #2172
  • [cutedsl] prune dead tcgen05 role scaffolding by @jansel in #2173
  • [cutedsl] simplify tcgen05 layout plan by @jansel in #2174
  • [cutedsl] guard tcgen05 persistent multi-tile at runtime by @jansel in #2193
  • Fix flash attention benchmark CI by @choijon5 in #2207
  • [cute] enable test unroll tuples by @oulgen in #2187
  • [HostFunction] Extract _parse_source from Hostfunction.__init__ by @hinriksnaer in #2154
  • Dashboard: latency-as-default-graph, noise muting, platform sync, color fixes by @choijon5 in #2208
  • [cutedsl] split tcgen05 persistent setup into layout + prelude + tile body by @jansel in #2194
  • [cutedsl] add tcgen05 persistent role-block scaffolding + TMA-load tagging by @jansel in #2195
  • [cutedsl] split tcgen05 per-K-iter TMA producer/consumer block by @jansel in #2196
  • [cutedsl] recurse partitioner into K-loop body for tcgen05 TMA producer by @jansel in #2197
  • [cutedsl] split tcgen05 per-K-iter TMA builders into named helpers by @jansel in #2198
  • [cutedsl] split tcgen05 initial-prefetch IF emission into AST helper by @jansel in #2199
  • [cutedsl] dedupe tcgen05 codegen test mocks via _testing helpers by @jansel in #2200
  • [cutedsl] extract tcgen05 multi-tile guard var/message into class constants by @jansel in #2201
  • [cutedsl] consolidate cute reduction branches and tcgen05 autotune narrowing by @jansel in #2202
  • [cutedsl] extract _count_rdim_axes_in_val helper in roll_reduction by @jansel in #2203
  • [cutedsl] narrow tcgen05_num_epi_warps autotune to (4,) to avoid wrong output by @jansel in #2204
  • [cutedsl] reject tcgen05_num_epi_warps != 4 at codegen + diagnose root cause by @jansel in #2205
  • [cutedsl] add tcgen05 role-local-while builder infrastructure (3b-prep-4) by @jansel in #2206
  • Add missing onlyBackends([cute]) by @jansel in #2219
  • [cute] Enable test_indexing by @oulgen in #2210
  • [cute] Add basic autotuning capabilities by @oulgen in #2221
  • [Pallas] Use exprs from AST instead of SymPy exprs when generating loop bounds by @AmesingFlank in #2211
  • [Pallas] When there are data-dependent loop bounds, also use fori_loop instead of unroll by @AmesingFlank in #2212
  • [Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins by @AmesingFlank in #2213
  • [Pallas] Apply tile masks at load time to zero out-of-bounds data by @AmesingFlank in #2214
  • Temporarily disable XPU CI by @jansel in #2270
  • Support HELION_AUTOTUNE_EFFORT=none HELION_FORCE_AUTOTUNE=1 by @oulgen in #2265
  • Retry more flakey job types by @jansel in #2269
  • [cutedsl] split persistent tma producer role by @jansel in #2224
  • [cutedsl] split persistent mma exec role by @jansel in #2225
  • [cutedsl] split persistent epi role by @jansel in #2226
  • [cutedsl] validate tcgen05 persistent z-grid by @jansel in #2227
  • [AOT] Suppress heuristic cache hit messages by default by @choijon5 in #2220
  • [Pallas] Fix DMA scratch buffer offset bug in nested fori_loop codegen by @AmesingFlank in #2217
  • [Pallas] Fix scratch Ref scoping bug in fori_loop/emit_pipeline codegen by @AmesingFlank in #2215
  • [cutedsl] restore scoped persistent pid autotune by @jansel in #2228
  • [cutedsl] Add flat tcgen05 TMA store epilogue by @jansel in #2229
  • [cutedsl] Add persistent tcgen05 TMA store epilogue by @jansel in #2230
  • [cutedsl] Close out tcgen05 TMA store acquire ordering by @jansel in #2231
  • [cutedsl] add guarded CtaGroup.TWO structural codegen by @jansel in #2232
  • [cutedsl] align two-CTA AB pipeline ownership by @jansel in #2233
  • [cutedsl] align two-CTA TMEM setup ordering by @jansel in #2234
  • [cutedsl] align two-CTA scheduler publication by @jansel in #2235
  • [cutedsl] advance guarded two-CTA role-local codegen by @jansel in #2236
  • [cutedsl] omit two-cta shared scheduler loop by @jansel in #2237
  • [cutedsl] Guard tcgen05 omit-shared scalar setup by @jansel in #2238
  • [cutedsl] Adjust two-CTA TMEM teardown by @jansel in #2239
  • [cutedsl] split tma tail capability check by @jansel in #2240
  • Fix tensor descriptor silent fallback for scalar SymInt subscripts by @ethche in #2222
  • [cutedsl] defer two-cta pipeline constructor sync by @jansel in #2241
  • [cutedsl] validate single-tile two-cta runtime by @jansel in #2242
  • [cutedsl] admit non-recycling two-cta tiles by @jansel in #2243
  • [cutedsl] validate shallow-k two-cta direct grid by @jansel in #2244
  • [cutedsl] validate long-k two-cta direct grid by @jansel in #2245
  • [cutedsl] enable CtaGroup.TWO TMA-store epilogue by @jansel in #2246
  • [cutedsl] elide CtaGroup.TWO role schedulers by @jansel in #2247
  • [cutedsl] restore two-cta scheduler recycling by @jansel in #2248
  • Skip grid td xpu by @ethche in #2275
  • [cutedsl] re-enable two-cta autotune search by @jansel in #2249
  • [cutedsl] seed two-cta autotune search by @jansel in #2250
  • [cutedsl] prune two-cta autotune failures by @jansel in #2251
  • [cutedsl] seed two-cta l2 grouping by @jansel in #2252
  • [cutedsl] seed two-cta tensor indexing by @jansel in #2253
  • [cutedsl] trim tcgen05 epilogue barrier by @jansel in #2255
  • [cutedsl] add two-cta pdl markers by @jansel in #2258
  • Temporarily disable pallas CI until upstream torch_tpu is fixed by @jansel in #2281
  • TPU CI: Use PyTorch nightly from 20260502 instead of most recent nightly to unblock CI by @AmesingFlank in #2298
  • [Pallas] Cast bool masks to float before expanding in _mask_to codegen by @AmesingFlank in #2216
  • [Pallas] Skip fp32 fallback for unary transcendentals on TPU by @norx1991 in #2268
  • [Pallas] Add xfail tests for BMM with non-zero K begin by @norx1991 in #2271
  • [Pallas] Fix pre-broadcasting transformation bug when non-broadcast dims exceed PRE_BROADCAST_SIZE by @AmesingFlank in #2223
  • [Pallas] Lower hl.zeros / hl.full to plain jnp.full by @norx1991 in #2278
  • [Pallas] Fix failing scratch shapes asserts due to land-time race when #2278 caused scratch shapes to be re-ordered by @AmesingFlank in #2302
  • [language] Add hl.rand4x for 4-output Philox RNG by @karthickai in #2283
  • Fix failing cutlass lints by @AmesingFlank in #2303
  • [xpu] Disable proton build for XPU by @Stonepia in #2300
  • [Pallas] Use dot_general instead of matmul for Pallas codegen by @AmesingFlank in #2299
  • [Autotuner] Enable autotuner seed configs by @ethche in #2276
  • update torch_tpu pin to a1ef0dd7fa2ffb730995e31953d1b5d316226c96 by @cota in #2316
  • TPU CI: Restore to using latest nightly pytorch by @AmesingFlank in #2320
  • [Pallas] Use jax_export_ignore_forward_compatibility=True when exporting JaxCallable, improving attention perf by @AmesingFlank in #2323
  • [Pallas] Make pallas_pre_broadcast a tunable autotune fragment by @norx1991 in #2324
  • Fix cute CI failures by @jansel in #2325
  • [cute] Pin to official 4.5.0 by @oulgen in #2326
  • [Pallas] Integrate TPU benchmarks into Benchmark Dispatch + dashboard by @norx1991 in #1913
  • remove obsolete _init_tpu_device helper by @thcmbs in #2319
  • Enable cute for error tests by @oulgen in #2339
  • [cutedsl] fix race in scalar store with full-slice subscript by @jansel in #2327
  • [cutedsl] reorder and hoist tcgen05 C-store epilogue by @jansel in #2328
  • [cutedsl] prefetch tcgen05 consumer token, reset UMMA accumulate per tile, seed pid order by @jansel in #2329
  • [cutedsl] add tcgen05 C-store / acc-wait / skip-UMMA / cubin-lineinfo diagnostic knobs by @jansel in #2330
  • [cutedsl] add tcgen05 split-first and store-tail T2R epilogue diagnostics by @jansel in #2331
  • [cutedsl] add tcgen05 module-helper T2R epilogue diagnostics by @jansel in #2332
  • [cutedsl] add tcgen05 role-local bridge codegen and tests by @jansel in #2333
  • [cutedsl] add tcgen05 role-local bridge pipeline, TMA mask, and ownership by @jansel in #2334
  • [cutedsl] add tcgen05 bridge AB acc-advance and acquire diagnostics by @jansel in #2335
  • [cutedsl] add tcgen05 bridge AB wait, phase, and initial-acquire diagnostics by @jansel in #2336
  • [cutedsl] add tcgen05 larger-BN codegen test by @jansel in #2337
  • Enable cute for int64 indexing tests by @oulgen in #2340
  • Enable cute for cache tests by @oulgen in #2341
  • Enable cute for autotune tests by @oulgen in #2342
  • Add reduction support to helion autodiff by @karthickai in #1747
  • Dashboard: only nightly runs populate Overview; manual dispatches stay in Compare by @choijon5 in #2355
  • [Pallas] Don't pipeline tensors read or written outside the inner loop by @norx1991 in #2284
  • [tutorials] pretuned Helion examples by @choijon5 in #2209
  • [docs] Add AOT autotuning documentation by @choijon5 in #2274
  • Enable cute for stack tensor tests by @oulgen in #2349
  • Enable cute for jagged tile tests by @oulgen in #2350
  • Enable cute for epilogue subtiling tests by @oulgen in #2352
  • [Pallas] Align TPU kernel names with GPU dashboard + add compile-time measurement by @norx1991 in #2354
  • [runtime:pallas] consolidate pallas_aliases computation in _pallas_prepare_args by @cota in #2348
  • [docs] Fix broken tutorials/ links in AOT autotuning doc by @norx1991 in #2358
  • Enable tensor_descriptor for static_shapes = False by @ethche in #2356
  • [compile] Introduce KernelCompiler for pipeline orchestration by @hinriksnaer in #2267
  • [Pallas] Default pallas_loop_type to emit_pipeline by @norx1991 in #2321
  • ci: bump TPU jax/jaxlib pin to 0.10.0 by @thcmbs in #2361
  • Dashboard: fix nightly detection and restrict Overview to main branch by @choijon5 in #2363
  • Dashboard: add geomean footer rows to Speedup and Compare tables by @choijon5 in #2364
  • [compile] Add create_reduction_strategy() and adjust_reduction_thread_count to Backend by @hinriksnaer in #2318
  • Dashboard: don't flag manual-only kernels as infra_missing by @norx1991 in #2366
  • Support lowering torch.bmm(..., dtype=), use it for attention to avoid redundant fp32 -> fp16 -> fp32 roundtrip by @AmesingFlank in #2365
  • [cute] Generalize sythetic lane loops and loop-carried accumulator checks by @hinriksnaer in #2347
  • [Pallas] Make per-tile element cap a backend hook, disable for Pallas by @norx1991 in #2282
  • [Pallas] Add larger shapes for attention and matmul_layernorm benchmarks by @norx1991 in #2368
  • Docs: merge AOT Autotuning into Deployment guide; list pretuned kernels under examples. by @choijon5 in #2369
  • Relax RMS pretuned kernel perf wins gate by @choijon5 in #2371
  • [Bechmarking] Default to benchmark subprocess by @choijon5 in #2372
  • [Autotuner] Clear Triton JIT fast-path caches after benchmarks by @ethche in #2367
  • Support dynamic TD guards for container tensors by @ethche in #2370
  • Increase effort for autoreview by @jansel in #2375
  • [Dashboard] Use paired geomeans in comparisons by @choijon5 in #2376
  • update torch_tpu pin to 104763049fe1df6834605fed1cd2b79434ea02d5 by @cota in #2362
  • Use HALF_DTYPE in epilogue subtiling example by @thcmbs in #2343
  • Relax test_pretuned_kernels.py targets by @jansel in #2379
  • [pallas] remove remaining torch_tpu.api usage by @cota in #2345
  • rms_norm: save inv_rms in fp32 to fix bf16 backward by @thcmbs in #2273
  • [cutedsl] Add Tcgen05Strategy/WarpSpec data model and config keys by @jansel in #2380
  • [cutedsl] Drive matmul warp roles from generated warp-spec records by @jansel in #2381
  • [cutedsl] Implement ROLE_LOCAL_WITH_SCHEDULER matmul strategy by @jansel in #2382
  • [cutedsl] Pin output dtype on matmul plan for epilogue tile shape by @jansel in #2383
  • [cutedsl] Add CLC-persistent tile scheduler for matmul by @jansel in #2384
  • [cutedsl] Support cluster_n=2 in matmul lowering by @jansel in #2385
  • [cutedsl] Add unary epilogue chain analyzer and splicing by @jansel in #2386
  • [cutedsl] Fuse auxiliary tensor loads in epilogue chains by @jansel in #2387
  • [cutedsl] Enable cluster_n=2 under role-local scheduler by @jansel in #2388
  • [cutedsl] Gate cluster_m=2 search by wave quantization and broaden epi-fusion shapes by @jansel in #2389
  • [cutedsl] Add A/B SMEM and L2 scheduler swizzle controls by @jansel in #2390
  • Add _gelu_tanh_approx op by @jansel in #2391
  • [Autotuner] Confirm suspicious subprocess timings by @choijon5 in #2377
  • Attention Perf: Multiply Q in-loop to avoid memory spillage by @AmesingFlank in #2373
  • Attention Perf: Transpose blocked K right before QK instead of pre-transposing before the kernel by @AmesingFlank in #2374
  • [compiler][autotuner] Autotuner heuristics by @ethche in #2392
  • [compile] Remove unused device_load_count by @hinriksnaer in #2395
  • [compile] Eliminate two-phase initialization of HostFunction by @hinriksnaer in #2396
  • Avoid dynamic shape recompiles for 0/1 tensor dimensions by @oulgen in #2353
  • Add Claude Code workflow by @choijon5 in #2412
  • [XPU] Disable torch.compile fusion to unblock Inductor range-symbol failures by @karthickai in #2413
  • Skip unsupported fbcode and MTIA RNG tests by @choijon5 in #2414
  • [cutedsl] gate tcgen05_ab_stages=3 search behind per-CTA SMEM budget by @jansel in #2400
  • [cutedsl] add autotune sweep harness for CuTe examples by @jansel in #2401
  • [cutedsl] fix universal-MMA lane-loop and grid codegen guards by @jansel in #2402
  • [autotuner] make initial-population benchmark phase budget-aware by @jansel in #2403
  • [cutedsl] predicate atomic ops on CTA-resident ghost axes by @jansel in #2404
  • [cutedsl] widen c_input_warps and loop_orders autotune surface by @jansel in #2405
  • [cutedsl] discover aux-tensor descriptors at tcgen05 MMA codegen by @jansel in #2406
  • [cutedsl] emit c-input warp role-local while for residual scheduling by @jansel in #2407
  • [cutedsl] allocate c-input warp aux pipeline without producer barrier ops by @jansel in #2408
  • [cutedsl] fuse c-input warp aux prefetch into smem ring by @jansel in #2409
  • [cutedsl] widen autotune surface for c_input warp on residual kernels by @jansel in #2410
  • [cutedsl] fix c_input warp aux tile coords for cluster + l2_groupings by @jansel in #2411
  • RoPE kernel by @ethche in #2415
  • Add RoPE to nightly GPU benchmarks by @choijon5 in #2419
  • Add Mamba2 and GDN to nightly GPU benchmarks by @choijon5 in #2424
  • [compile] Add DeviceFunction.resolved_block_size(block_id) helper by @hinriksnaer in #2418
  • Fix perf dashboard for newly added examples by @choijon5 in #2434
  • Add RemoteCacheBackend ABC for pluggable remote autotune caching by @fulvius31 in #2317
  • Add CuTe NVFP4 GEMV example by @oulgen in #2433
  • [test/cute] use env directly in _get_mma_k_loop_info for block size resolution by @hinriksnaer in #2444
  • Infer CuTe NVFP4 conversions from dtypes by @oulgen in #2437
  • torch_tpu: update pin to 157713848ac0a510eb3a057c550861d999d4ec93 by @cota in #2438
  • Fixing the stale global memory read for AMD GPUs. by @umechand-amd in #1845
  • Make Triton do_not_specialize opt-in by @choijon5 in #2426
  • [metal] eliminate trivial stride-1 multiplication in MSL codegen by @hinriksnaer in #2432
  • [metal] use array subscript syntax for MSL memory access by @hinriksnaer in #2441
  • [metal] format MSL signature with one parameter per line by @hinriksnaer in #2442
  • Fix B200 benchmark CI failures by @choijon5 in #2455
  • [Pallas] Add fused_linear_jsd and grpo_loss to TPU benchmark sweep by @norx1991 in #2421
  • [Pallas] Accept kernels_tpu='all' for full-coverage TPU bench by @norx1991 in #2459
  • Limit A10G CI pytest workers to two by @choijon5 in #2457
  • Enable cudagraph for running examples by @choijon5 in #2461
  • Skip pretuned kernel perf gating in fbcode by @choijon5 in #2463
  • Optimize nvfp4 CuTe perf paths by @oulgen in #2462
  • ci: declare workflow-level contents: read on 3 workflows by @arpitjain099 in #2460
  • [Autotuner] LLM search: effort_level knob + Anthropic adaptive thinking + OpenAI xhigh by @choijon5 in #2446
  • [Autotuner] LLM search: Anthropic Opus 4.6/4.7 fast mode by @choijon5 in #2450
  • [Autotuner] LLM search: fail loudly + mTLS gateway compatibility by @choijon5 in #2448
  • [Examples] rope: print benchmark table via run_example by @choijon5 in #2451
  • [Autotuner] LLM prompt: diversify num_stages/num_warps in seed batch by @choijon5 in #2465
  • Add H100 (sm90) pretuned heuristics and perf gates by @choijon5 in #2454
  • Add Cute benchmark by @oulgen in #2466
  • [cache] Wire from_best_available / from_cache to RemoteCacheBackend by @fulvius31 in #2453
  • [Pallas] Add epilogue_subtiling to TPU benchmark sweep by @norx1991 in #2458
  • Fix rope-bwd cudagraph crash and tighten gdn_fwd_h accuracy gate by @choijon5 in #2467
  • [Pallas] Slice store values to match clamped Pallas BlockSpec ref shape by @thcmbs in #2398
  • PR Summary: Support reductions under branch-by-grid control flow by @yushangdi in #2480

  • Restore TF32 backend state by @yushangdi in #2482
  • [Pallas] Add kl_div to TPU benchmark sweep and dashboard by @norx1991 in #2484
  • Fix nightly perf CI by @choijon5 in #2476
  • torch_tpu: update pin to 3fb6cdbd96180e69df2233db51089656b230e6b6 by @cota in #2474
  • [docs] Document remote autotune cache and warm-start behavior by @fulvius31 in #2475
  • [CI fix] Raise benchmark subprocess timeout to 90s in fbcode by @choijon5 in #2487
  • Factor-out a bit of common logic for finding return names of if and else branches by @AmesingFlank in #2486
  • [Pallas] Add FP8 dtype mappings to torch-to-JAX table by @thcmbs in #2489
  • [Pallas] sympy mod printer by @thcmbs in #2490
  • Fix keepdim scalar reduction reshape in Triton codegen by @yushangdi in #2483
  • [Pallas] Add test for fused_linear_jsd_fwd autograd path by @norx1991 in #2456
  • Only include common outputs as outputs of traced if subgraph by @AmesingFlank in #2485
  • [dashboard] Suppress 'No Result' for kernels removed from workflow defaults by @choijon5 in #2492
  • [Pallas] Add a helion setting for pallas interpret mode by @AmesingFlank in #2522
  • [Pallas] Render outer_prefix for emit_pipeline and fori_loop scopes by @norx1991 in #2496
  • [Pallas] Thread pallas_interpret through runtime launchers by @AmesingFlank in #2524
  • [Pallas] Disable factory padding and preserve concrete dims by @thcmbs in #2477
  • [Pallas] Support tile.index broadcast indexing in load codegen by @norx1991 in #2532
  • [Pallas] Make Pallas interpret mode honor TPU constraints by @norx1991 in #2525
  • [Pallas] Route meta output-only tensors to CPU under interpret by @norx1991 in #2526

New Contributors

Full Changelog: v1.0.0...v1.1.0