What's Changed
- [Pallas] Use TPU-aware synchronize_device across autotuner and testing by @norx1991 in #1944
- Refactor if-else branches, unblock test_if_new_variable_in_static_range for Pallas TPU by @AmesingFlank in #1935
- [Pallas] Register pallas_loop_type only when inner loops exist by @norx1991 in #1915
- [Pallas] Fix emit_pipeline program_id mapping with loop_order reordering by @norx1991 in #1916
- [Pallas] Add expected-failure test for non-power-of-2 RDIM size by @norx1991 in #1945
- Update docs: add missing API entries by @choijon5 in #1948
- [Benchmark] Add compile time measurement to CI benchmarks by @choijon5 in #1952
- use is_symm_mem_tensor if the API is available by @shunting314 in #1933
- [Pallas] When not doing tiling for entire kernel, use explicit BlockSpecs instead of None by @AmesingFlank in #1960
- Fix FROM_BEST_AVAILABLE matching with hl.specialize() after #1883 by @fulvius31 in #1940
- [Pallas] Fix indexing scalars using SMEM memory space by @AmesingFlank in #1955
- [Pallas] Add xfail tests for scalar .begin index not collapsing dims by @norx1991 in #1971
- Skip test_hl_rand_mixed_argument_order on MTIA due to unaligned address crash by @karthickai in #1977
- Fix _supports_maxnreg() to guard against non-CUDA backends by @karthickai in #1981
- Skip TestRandomPhiloxParity class on MTIA (#1979) by @karthickai in #1982
- [Pallas] Fix acccesing tensors using index from hl.grid() by @AmesingFlank in #1956
- [Pallas] Support using traced size-1 tensor as condition predicate, unblocking test_if_arg_indexed_scalar by @AmesingFlank in #1957
- [Pallas] Fix atomic_add dtype cast and VMEM preload for fori/pipeline launchers by @thcmbs in #1966
- [Pallas] Use exact RDIM size instead of next-power-of-2 by @norx1991 in #1954
- [Pallas] Add support for accessing tensors with the pattern of tile.index + offset /.id/.begin/.end by @AmesingFlank in #1968
- [metal] Reuse Inductor's MetalOverrides for MSL expression emission by @aditvenk in #1853
- [metal] Add Metal codegen handlers for load, store, and mask_to by @aditvenk in #1854
- [Pallas] Enable a subset of test_grid tests for Pallas by @AmesingFlank in #1985
- [Pallas] Add a test for accessing tensors with hl.grid() index + offset by @AmesingFlank in #1988
- Relax rms_norm example tolerance for Pallas bf16 by @thcmbs in #1983
- [autotuner] Introduce BenchmarkProvider abstraction for kernel benchmarking by @hinriksnaer in #1928
- [Pallas] Use HBM BlockSpecs for output-only tensors to save VMEM by @norx1991 in #1984
- Emit autotune failure summary warnings by @allgather in #1994
- [Pallas] Remove unused is_device_loop variable in _pallas_index_str by @AmesingFlank in #1990
- [Pallas] Fix BlockSpecs for 2D tl.grid([m, n]), unblocking test_scalar_access_hl_grid_2d by @AmesingFlank in #1986
- Reduce some autotuner overhead without changing kernel behavior by @svdrecbd in #1885
- [compile] Add
pre_codegenhook to Backend ABC by @hinriksnaer in #1976 - Adding AMD Mi350x machines to CI with new labels. by @umechand-amd in #1835
- [Pallas] Fix accessing tensor via hl.grid() index within a device loop, unblocking test_scalar_access_hl_grid_2d_nested by @AmesingFlank in #1989
- Support mtia in LocalAutotuneCache by @Hamlin-Li in #1996
- removed stale comments by @hinriksnaer in #1997
- Fix invalid default config for kernels with large tensor numel by @fulvius31 in #1839
- chore: Bump actions/github-script from 8 to 9 by @dependabot[bot] in #2000
- Fix _benchmark dropping configs that fail compilation by @fulvius31 in #1942
- [metal] Add MSL AST walker for Python-to-C++ translation by @aditvenk in #1794
- Enable tensor_descriptor based atomic ops by @ethche in #1953
- [metal] Add @metal_jit decorator for AST-to-MSL compilation by @aditvenk in #1991
- Epilogue subtiling: store indexing fix, example, and tuple output support in run_example by @choijon5 in #1907
- [Compiler] Added backend registry by @hinriksnaer in #1967
- [metal] Wire @metal_jit into MetalBackend and simplify launcher by @aditvenk in #1992
- [Pallas] Skip trivial reduction mask when RDIM size equals actual dim by @norx1991 in #1993
- [Pallas] Fix TPU min_dot_size for matmul autotuning by @norx1991 in #1999
- chore: Bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #2015
- Make _get_tile_with_offset_info accept a torch.fx.Node as arg, instead of the entire CodegenState by @AmesingFlank in #2005
- [metal] Skip tests better to pass in internal CI by @aditvenk in #2016
- Add to_code() as backend-agnostic alias for to_triton_code() by @norx1991 in #2012
- [Pallas] Fix fori_loop multi-dim index decomposition with nested loops by @norx1991 in #1917
- custom config filter by @shunting314 in #1847
- [Pallas] Reject 64-bit input tensors and fix tiling ZeroDivisionError by @norx1991 in #1950
- Bump CI libtpu to 0.0.40 by @AmesingFlank in #2017
- [Pallas] Make test_tensor_access_tile_index_offset more meaningful and reveal codegen issue in Pallas backend by @AmesingFlank in #2006
- [Pallas] [NFC] Dedup _jnp_dtype_map helper by @thcmbs in #2021
- [Pallas] Add a
plan_tilingpre_codegen pass to make more consistent tiling and indexing decisions by @AmesingFlank in #2007 - [Compiler] Add
reserved_launch_param_namesto Backend ABC by @hinriksnaer in #1970 - [Pallas] Remove non-tiled fallback paths which are no longer used after #2007 by @AmesingFlank in #2026
- [TPU][Pallas] Add lower bound analytical VMEM estimation and OOM guard for Pallas launchers by @yarongmu-google in #2024
- [Pallas] Fix codegen for slice indexing when there are squeezed dimensions by @AmesingFlank in #2027
- [CI] Pin Triton-to-tile-IR to last known-good commit by @norx1991 in #2028
- [Pallas] Add non-DMA fori_loop fallback for DMA-unaligned inner blocks by @thcmbs in #1969
- Skip
test_matmul_smaller_than_min_dot_sizeon MTIA (#2030) by @will-cromar in #2030 - docs: add missing autofunction directives for 9 API functions by @Bodlux in #2036
- [AMD ROCM] matrix_instr_nonkdim restricted to 16 by @umechand-amd in #2032
- [Autotuner] Move benchmarking implementation into BenchmarkProvider by @hinriksnaer in #2029
- [Bug][Autotuner] Autotuner failed to clone mutated input arg by @xiaohongchen1991 in #2042
- [TPU][Pallas] Fix OOM on large reductions by delegating chunking to Mosaic by @yarongmu-google in #2033
- Remove ref baseline kernel count for test_clone_with_multiple_views_one_mutated as it depends on PyTorch version by @choijon5 in #2037
- docs: remove remaining mentions of HELION_USE_AUTOTUNE by @cota in #2038
- Fix test broken by #2033: use env.backend_name by @norx1991 in #2046
- Simplify CI: update PyTorch releases to 2.11, use bundled Triton by @choijon5 in #2047
- [Pallas] Under interpret mode, Use float16 as HALF_DTYPE because bfloat16 is not supported on CPU by @AmesingFlank in #2050
- [AMD ROCm] Use AMD Triton backend for min_dot_size instead of NVIDIA by @choijon5 in #2048
- [Pallas] Adjust block size constraints by analyzing subscript exprs on tensors, unblock rms_norm_bwd example and a few tests by @AmesingFlank in #2051
- ND jagged tile support by @nullplay in #2052
- New performance dashboard with GitHub Pages deployment by @choijon5 in #2053
- [Pallas] Fix test_squeeze_slice_access to use code_and_output by @norx1991 in #2057
- Add Metal test job to CI test matrix by @aditvenk in #1862
- [Pallas] Add xfail tests for bmm non-divisible reduction by @norx1991 in #2031
- [Autotuner] Adding LLM-guided search by @choijon5 in #2003
- [cutedsl] Refactor reductions to use helper methods by @jansel in #2008
- [cutedsl] Strengthen layout planning pass invariants by @jansel in #2009
- [cutedsl] Improve dot with epilogue handling by @jansel in #2014
- [cutedsl] Plan grouped-N matmuls and lower atomic tensor indices by @jansel in #2020
- [Pallas] Exclude output-only tensors from Pallas pallas_call inputs to improve performance by @norx1991 in #1849
- [Pallas] Use FakeTensorMode to avoid HBM allocation for output-only tensors by @norx1991 in #2022
- Add simplified se_block kernel (#989) by @mengluy0125 in #989
- [Pallas] Fix symbolic offset codegen in TileIndexWithOffsetPattern by @norx1991 in #2068
- [Pallas] Replace FakeTensorMode wrap with device='meta' for output-only tensors by @norx1991 in #2071
- [TPU][Pallas] Enable TPU support and fix benchmarking for AOT compilation example by @yarongmu-google in #2059
- fix misleading benchmarking for fp8 gemm by @shunting314 in #1980
- add flashinfer allreduce-rmsnorm kernel by @shunting314 in #2063
- [Autotuner] Make the autotuner robust to
InvalidConfigby @bringlein in #2039 - Deploy new perf dashboard to GitHub Pages by @choijon5 in #2066
- Add backend-agnostic lane loop APIs to tile strategies by @aditvenk in #1798
- [Pallas] Process tensor access within external lambdas when adjusting block size constraints by @AmesingFlank in #2073
- [Pallas] Fix emit_pipeline/fori_loop codegen when multiple inner loops tile the same dim by @norx1991 in #2075
- use non-interleaved benchmarking for all-reduce-rmsnorm by @shunting314 in #2065
- Unify dashboard deployment with docs deploy by @choijon5 in #2082
- [Pallas] Refactor memory space tracking into PallasMemorySpace enum by @norx1991 in #2072
- Dashboard: restrict Overview/Speedup to main branch, track latency, UI polish by @choijon5 in #2084
- [Pallas] Treat tile.id subscripts as untileable scalar indices by @norx1991 in #2083
- [Pallas] Validate pallas_loop_type by @thcmbs in #2055
- [NFC] [Pallas] Move indexing codegen helpers to pallas/codegen.py by @thcmbs in #2067
- [Autotuner] Reland LLM-seeded hybrid search (originally #2004) by @choijon5 in #2091
- Dashboard: fix crash-masking in CI status, split failures (accuracy/run/infra), chart polish by @choijon5 in #2086
- [Pallas] More robust analysis of tensors reads/writes via FX graph instead of AST, allowing more aggresive output_only optimizations by @AmesingFlank in #2088
- [Autotuner] Skip
temperaturefor claude-opus-4-7 (HTTP 400) by @choijon5 in #2089 - [Autotuner] Raise LLM response token budget to fit verbose configs by @choijon5 in #2090
- Dashboard: dedupe MI325X duplicate entries, rename Runner Failures → No Result with last-seen date by @choijon5 in #2094
- Dashboard: include cancelled runs so partial artifacts aren't lost to 6h timeout by @choijon5 in #2095
- Fix hardcoded CUDA device in jagged_dense_bmm example by @norx1991 in #2080
- [CI] fix 13.2 TileIR ci pipeline by @qelk123 in #2076
- allow reuse variables across different static loops by @shunting314 in #2081
- [Pallas] Fix SMEM/VMEM conflict for tensors with mixed access patterns by @norx1991 in #2069
- Removing MI325 runners from CI by @umechand-amd in #2096
- Benchmark: Reduce running Tritonbench for each kernel from twice to once. by @choijon5 in #2097
- [ROCM ] Improves ROCm compatibility for distributed kernels and expands ROCm test coverage in distributed test suites. by @umechand-amd in #2049
- [dashboard] temporarily remove --existing-url to rebuild cache by @choijon5 in #2109
- Pin tritonbench commit in a file. by @umechand-amd in #2106
- [dashboard] re-enable --existing-url and filter platforms from dispatch workflow by @choijon5 in #2110
- Fix flaky distributed and torch.compile tests by @choijon5 in #2098
- [Benchmarking/CI] Prevent hangs in benchmark phase via subprocess + per-config run timeout by @choijon5 in #2111
- [Autotuner] Seed LFBO surrogate with stage-1 LLM benchmarks in hybrid search by @choijon5 in #2113
- [dashboard] Fix empty dashboard caused by broken GitHub API query filters by @choijon5 in #2118
- [Pallas] Host-side padding for non-divisible pl.ds() dimensions by @norx1991 in #2104
- Update tritonbench.txt by @umechand-amd in #2121
- Fix flaky test due to spurious NaN in fusion autotune accuracy check by @choijon5 in #2115
- [Autotuner] Skip subprocess sticky CUDA errors instead of aborting autotune by @choijon5 in #2122
- [Pallas] Adding a tunable pre_broadcast optimization pass for TPU scratch buffers, improving TPU attention perf by @AmesingFlank in #2103
- [dashboard] Trigger docs-deploy explicitly from benchmark dispatch by @choijon5 in #2129
- [Pallas] Fix multi-dim padding overwriting original tensor reference by @norx1991 in #2120
- [Pallas] Extend padding to fori_loop DMA and emit_pipeline via _record_pad_info by @norx1991 in #2105
- [Autotuner] Handle PyTorch CUDA OOM as a skippable error instead of aborting autotuning by @ethche in #2130
- [Pallas] Indirect gather with pluggable strategies by @thcmbs in #2054
- [lint] upgrade to pyrefly 0.63.1 by @oulgen in #2134
- [lint] upgrade ruff to 0.15.12 by @oulgen in #2135
- [Pallas] Add pl.multiple_of alignment hint to pl.ds() offsets by @norx1991 in #2116
- Reduce measure() overhead when compile-time tracking is disabled by @gmagogsfm in #2139
- [CI] Enable build and test for XPU by @chuanqi129 in #1327
- [Autotuner] Catch expected errors during fork precompiler setup instead of aborting by @ethche in #2142
- Optimize cache key computation overhead by @gmagogsfm in #2144
- [Pallas] Honor _smem_arg_indices in pipeline launchers by @norx1991 in #2143
- This PR updates the tritonbench commit to its current ToT where the PR to fix the segfaults has landed. by @umechand-amd in #2149
- [Benchmarking] Disable cudagraph for layer_norm-bwd / rms_norm-bwd by @choijon5 in #2127
- [docs] Add dashboard link, LLM autotuner docs, remove past events by @choijon5 in #2151
- [Pallas] Per-dim VMEM accounting for gather budget check by @thcmbs in #2137
- [cute] Enable TestControlFlow by @oulgen in #2136
- [cute] Add codegen for hl.split, hl.join, and aten.view.dtype; enable test_views.py by @oulgen in #2138
- [Pallas] Rename pallas_loop_type "default" to "unroll" by @norx1991 in #2155
- Remove redundant
m_iupdate in example attention kernel by @AmesingFlank in #2156 - Use integer arithmetic instead of triton.cdiv in launcher by @gmagogsfm in #2146
- unbreak docs build by @oulgen in #2162
- [Pallas] Use torch.addmm in matmul_layernorm K-loop by @norx1991 in #2141
- [cute] Enable bunch of test suites by @oulgen in #2159
- Minor runpod updates by @jansel in #2163
- Update AGENTS.md by @jansel in #2164
- Add cute-verify skill by @jansel in #2165
- Add scripts/autoreview.py by @jansel in #2166
- Run codespell from ./lint.sh by @jansel in #2175
- [cutedsl] Matmul preformance prework by @jansel in #2167
- Small attention optimization: pre-scale q tile with qk_scale by @AmesingFlank in #2157
- [Pallas] Tighten _check_dma_alignment + make "unroll" tests explicit by @norx1991 in #2158
- [cute] Bump minimum cute version to 4.5 by @oulgen in #2180
- Fix compile time measurements by @choijon5 in #2188
- torch_tpu: update pin to 28d941aec27 by @cota in #1895
- Remove redundant compile time env by @choijon5 in #2191
- [Pallas] Emit offset/indices at inner-loop body prologue by @norx1991 in #2181
- [Autotuner] Fix crash when autotuner_min exceeds max_size by @stmcgovern in #2177
- Fix attention benchmark accuracy by @choijon5 in #2178
- [Pallas] Enable se_block tests on TPU + simplify skipIfCudaCapabilityLessThan by @norx1991 in #2131
- [cute] Implement topk and sort by @oulgen in #2160
- Fix negative shift by @oulgen in #2185
- [flaky test] skip register cache test on XPU by @choijon5 in #2192
- Add fix-pr skill by @jansel in #2189
- Add offsets kwarg to hl.rand for explicit Philox offsets by @karthickai in #2153
- [Pallas] Per-tensor pipelining decision in fori_loop and emit_pipeline by @norx1991 in #2093
- [cute] Implement associate scan by @oulgen in #2161
- Add
helion.from_cache()for FiniteSearch warm-start by @fulvius31 in #2079 - [cute] Enable test_print by @oulgen in #2186
- [BoundKernel] added _normalize_config by @hinriksnaer in #2152
- [cutedsl] Improve CuTe tcgen05 matmul autotuning and direct-store epilogues by @jansel in #2168
- [cutedsl] Compile CuTe launchers once and harden regression coverage by @jansel in #2169
- [cutedsl] track tcgen05 per-tile setup and register split by @jansel in #2170
- [cutedsl] add autotune wall-time budget by @jansel in #2171
- [cutedsl] split tcgen05 persistent post-loop cleanup by @jansel in #2172
- [cutedsl] prune dead tcgen05 role scaffolding by @jansel in #2173
- [cutedsl] simplify tcgen05 layout plan by @jansel in #2174
- [cutedsl] guard tcgen05 persistent multi-tile at runtime by @jansel in #2193
- Fix flash attention benchmark CI by @choijon5 in #2207
- [cute] enable test unroll tuples by @oulgen in #2187
- [HostFunction] Extract
_parse_sourcefromHostfunction.__init__by @hinriksnaer in #2154 - Dashboard: latency-as-default-graph, noise muting, platform sync, color fixes by @choijon5 in #2208
- [cutedsl] split tcgen05 persistent setup into layout + prelude + tile body by @jansel in #2194
- [cutedsl] add tcgen05 persistent role-block scaffolding + TMA-load tagging by @jansel in #2195
- [cutedsl] split tcgen05 per-K-iter TMA producer/consumer block by @jansel in #2196
- [cutedsl] recurse partitioner into K-loop body for tcgen05 TMA producer by @jansel in #2197
- [cutedsl] split tcgen05 per-K-iter TMA builders into named helpers by @jansel in #2198
- [cutedsl] split tcgen05 initial-prefetch IF emission into AST helper by @jansel in #2199
- [cutedsl] dedupe tcgen05 codegen test mocks via _testing helpers by @jansel in #2200
- [cutedsl] extract tcgen05 multi-tile guard var/message into class constants by @jansel in #2201
- [cutedsl] consolidate cute reduction branches and tcgen05 autotune narrowing by @jansel in #2202
- [cutedsl] extract _count_rdim_axes_in_val helper in roll_reduction by @jansel in #2203
- [cutedsl] narrow tcgen05_num_epi_warps autotune to (4,) to avoid wrong output by @jansel in #2204
- [cutedsl] reject tcgen05_num_epi_warps != 4 at codegen + diagnose root cause by @jansel in #2205
- [cutedsl] add tcgen05 role-local-while builder infrastructure (3b-prep-4) by @jansel in #2206
- Add missing onlyBackends([cute]) by @jansel in #2219
- [cute] Enable test_indexing by @oulgen in #2210
- [cute] Add basic autotuning capabilities by @oulgen in #2221
- [Pallas] Use exprs from AST instead of SymPy exprs when generating loop bounds by @AmesingFlank in #2211
- [Pallas] When there are data-dependent loop bounds, also use fori_loop instead of unroll by @AmesingFlank in #2212
- [Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins by @AmesingFlank in #2213
- [Pallas] Apply tile masks at load time to zero out-of-bounds data by @AmesingFlank in #2214
- Temporarily disable XPU CI by @jansel in #2270
- Support HELION_AUTOTUNE_EFFORT=none HELION_FORCE_AUTOTUNE=1 by @oulgen in #2265
- Retry more flakey job types by @jansel in #2269
- [cutedsl] split persistent tma producer role by @jansel in #2224
- [cutedsl] split persistent mma exec role by @jansel in #2225
- [cutedsl] split persistent epi role by @jansel in #2226
- [cutedsl] validate tcgen05 persistent z-grid by @jansel in #2227
- [AOT] Suppress heuristic cache hit messages by default by @choijon5 in #2220
- [Pallas] Fix DMA scratch buffer offset bug in nested fori_loop codegen by @AmesingFlank in #2217
- [Pallas] Fix scratch Ref scoping bug in fori_loop/emit_pipeline codegen by @AmesingFlank in #2215
- [cutedsl] restore scoped persistent pid autotune by @jansel in #2228
- [cutedsl] Add flat tcgen05 TMA store epilogue by @jansel in #2229
- [cutedsl] Add persistent tcgen05 TMA store epilogue by @jansel in #2230
- [cutedsl] Close out tcgen05 TMA store acquire ordering by @jansel in #2231
- [cutedsl] add guarded CtaGroup.TWO structural codegen by @jansel in #2232
- [cutedsl] align two-CTA AB pipeline ownership by @jansel in #2233
- [cutedsl] align two-CTA TMEM setup ordering by @jansel in #2234
- [cutedsl] align two-CTA scheduler publication by @jansel in #2235
- [cutedsl] advance guarded two-CTA role-local codegen by @jansel in #2236
- [cutedsl] omit two-cta shared scheduler loop by @jansel in #2237
- [cutedsl] Guard tcgen05 omit-shared scalar setup by @jansel in #2238
- [cutedsl] Adjust two-CTA TMEM teardown by @jansel in #2239
- [cutedsl] split tma tail capability check by @jansel in #2240
- Fix tensor descriptor silent fallback for scalar SymInt subscripts by @ethche in #2222
- [cutedsl] defer two-cta pipeline constructor sync by @jansel in #2241
- [cutedsl] validate single-tile two-cta runtime by @jansel in #2242
- [cutedsl] admit non-recycling two-cta tiles by @jansel in #2243
- [cutedsl] validate shallow-k two-cta direct grid by @jansel in #2244
- [cutedsl] validate long-k two-cta direct grid by @jansel in #2245
- [cutedsl] enable CtaGroup.TWO TMA-store epilogue by @jansel in #2246
- [cutedsl] elide CtaGroup.TWO role schedulers by @jansel in #2247
- [cutedsl] restore two-cta scheduler recycling by @jansel in #2248
- Skip grid td xpu by @ethche in #2275
- [cutedsl] re-enable two-cta autotune search by @jansel in #2249
- [cutedsl] seed two-cta autotune search by @jansel in #2250
- [cutedsl] prune two-cta autotune failures by @jansel in #2251
- [cutedsl] seed two-cta l2 grouping by @jansel in #2252
- [cutedsl] seed two-cta tensor indexing by @jansel in #2253
- [cutedsl] trim tcgen05 epilogue barrier by @jansel in #2255
- [cutedsl] add two-cta pdl markers by @jansel in #2258
- Temporarily disable pallas CI until upstream torch_tpu is fixed by @jansel in #2281
- TPU CI: Use PyTorch nightly from 20260502 instead of most recent nightly to unblock CI by @AmesingFlank in #2298
- [Pallas] Cast bool masks to float before expanding in _mask_to codegen by @AmesingFlank in #2216
- [Pallas] Skip fp32 fallback for unary transcendentals on TPU by @norx1991 in #2268
- [Pallas] Add xfail tests for BMM with non-zero K begin by @norx1991 in #2271
- [Pallas] Fix pre-broadcasting transformation bug when non-broadcast dims exceed PRE_BROADCAST_SIZE by @AmesingFlank in #2223
- [Pallas] Lower hl.zeros / hl.full to plain jnp.full by @norx1991 in #2278
- [Pallas] Fix failing scratch shapes asserts due to land-time race when #2278 caused scratch shapes to be re-ordered by @AmesingFlank in #2302
- [language] Add hl.rand4x for 4-output Philox RNG by @karthickai in #2283
- Fix failing cutlass lints by @AmesingFlank in #2303
- [xpu] Disable proton build for XPU by @Stonepia in #2300
- [Pallas] Use dot_general instead of matmul for Pallas codegen by @AmesingFlank in #2299
- [Autotuner] Enable autotuner seed configs by @ethche in #2276
- update torch_tpu pin to a1ef0dd7fa2ffb730995e31953d1b5d316226c96 by @cota in #2316
- TPU CI: Restore to using latest nightly pytorch by @AmesingFlank in #2320
- [Pallas] Use jax_export_ignore_forward_compatibility=True when exporting JaxCallable, improving attention perf by @AmesingFlank in #2323
- [Pallas] Make pallas_pre_broadcast a tunable autotune fragment by @norx1991 in #2324
- Fix cute CI failures by @jansel in #2325
- [cute] Pin to official 4.5.0 by @oulgen in #2326
- [Pallas] Integrate TPU benchmarks into Benchmark Dispatch + dashboard by @norx1991 in #1913
- remove obsolete _init_tpu_device helper by @thcmbs in #2319
- Enable cute for error tests by @oulgen in #2339
- [cutedsl] fix race in scalar store with full-slice subscript by @jansel in #2327
- [cutedsl] reorder and hoist tcgen05 C-store epilogue by @jansel in #2328
- [cutedsl] prefetch tcgen05 consumer token, reset UMMA accumulate per tile, seed pid order by @jansel in #2329
- [cutedsl] add tcgen05 C-store / acc-wait / skip-UMMA / cubin-lineinfo diagnostic knobs by @jansel in #2330
- [cutedsl] add tcgen05 split-first and store-tail T2R epilogue diagnostics by @jansel in #2331
- [cutedsl] add tcgen05 module-helper T2R epilogue diagnostics by @jansel in #2332
- [cutedsl] add tcgen05 role-local bridge codegen and tests by @jansel in #2333
- [cutedsl] add tcgen05 role-local bridge pipeline, TMA mask, and ownership by @jansel in #2334
- [cutedsl] add tcgen05 bridge AB acc-advance and acquire diagnostics by @jansel in #2335
- [cutedsl] add tcgen05 bridge AB wait, phase, and initial-acquire diagnostics by @jansel in #2336
- [cutedsl] add tcgen05 larger-BN codegen test by @jansel in #2337
- Enable cute for int64 indexing tests by @oulgen in #2340
- Enable cute for cache tests by @oulgen in #2341
- Enable cute for autotune tests by @oulgen in #2342
- Add reduction support to helion autodiff by @karthickai in #1747
- Dashboard: only nightly runs populate Overview; manual dispatches stay in Compare by @choijon5 in #2355
- [Pallas] Don't pipeline tensors read or written outside the inner loop by @norx1991 in #2284
- [tutorials] pretuned Helion examples by @choijon5 in #2209
- [docs] Add AOT autotuning documentation by @choijon5 in #2274
- Enable cute for stack tensor tests by @oulgen in #2349
- Enable cute for jagged tile tests by @oulgen in #2350
- Enable cute for epilogue subtiling tests by @oulgen in #2352
- [Pallas] Align TPU kernel names with GPU dashboard + add compile-time measurement by @norx1991 in #2354
- [runtime:pallas] consolidate pallas_aliases computation in _pallas_prepare_args by @cota in #2348
- [docs] Fix broken tutorials/ links in AOT autotuning doc by @norx1991 in #2358
- Enable tensor_descriptor for static_shapes = False by @ethche in #2356
- [compile] Introduce KernelCompiler for pipeline orchestration by @hinriksnaer in #2267
- [Pallas] Default pallas_loop_type to emit_pipeline by @norx1991 in #2321
- ci: bump TPU jax/jaxlib pin to 0.10.0 by @thcmbs in #2361
- Dashboard: fix nightly detection and restrict Overview to main branch by @choijon5 in #2363
- Dashboard: add geomean footer rows to Speedup and Compare tables by @choijon5 in #2364
- [compile] Add create_reduction_strategy() and adjust_reduction_thread_count to Backend by @hinriksnaer in #2318
- Dashboard: don't flag manual-only kernels as infra_missing by @norx1991 in #2366
- Support lowering
torch.bmm(..., dtype=), use it for attention to avoid redundant fp32 -> fp16 -> fp32 roundtrip by @AmesingFlank in #2365 - [cute] Generalize sythetic lane loops and loop-carried accumulator checks by @hinriksnaer in #2347
- [Pallas] Make per-tile element cap a backend hook, disable for Pallas by @norx1991 in #2282
- [Pallas] Add larger shapes for attention and matmul_layernorm benchmarks by @norx1991 in #2368
- Docs: merge AOT Autotuning into Deployment guide; list pretuned kernels under examples. by @choijon5 in #2369
- Relax RMS pretuned kernel perf wins gate by @choijon5 in #2371
- [Bechmarking] Default to benchmark subprocess by @choijon5 in #2372
- [Autotuner] Clear Triton JIT fast-path caches after benchmarks by @ethche in #2367
- Support dynamic TD guards for container tensors by @ethche in #2370
- Increase effort for autoreview by @jansel in #2375
- [Dashboard] Use paired geomeans in comparisons by @choijon5 in #2376
- update torch_tpu pin to 104763049fe1df6834605fed1cd2b79434ea02d5 by @cota in #2362
- Use HALF_DTYPE in epilogue subtiling example by @thcmbs in #2343
- Relax test_pretuned_kernels.py targets by @jansel in #2379
- [pallas] remove remaining torch_tpu.api usage by @cota in #2345
- rms_norm: save inv_rms in fp32 to fix bf16 backward by @thcmbs in #2273
- [cutedsl] Add Tcgen05Strategy/WarpSpec data model and config keys by @jansel in #2380
- [cutedsl] Drive matmul warp roles from generated warp-spec records by @jansel in #2381
- [cutedsl] Implement ROLE_LOCAL_WITH_SCHEDULER matmul strategy by @jansel in #2382
- [cutedsl] Pin output dtype on matmul plan for epilogue tile shape by @jansel in #2383
- [cutedsl] Add CLC-persistent tile scheduler for matmul by @jansel in #2384
- [cutedsl] Support cluster_n=2 in matmul lowering by @jansel in #2385
- [cutedsl] Add unary epilogue chain analyzer and splicing by @jansel in #2386
- [cutedsl] Fuse auxiliary tensor loads in epilogue chains by @jansel in #2387
- [cutedsl] Enable cluster_n=2 under role-local scheduler by @jansel in #2388
- [cutedsl] Gate cluster_m=2 search by wave quantization and broaden epi-fusion shapes by @jansel in #2389
- [cutedsl] Add A/B SMEM and L2 scheduler swizzle controls by @jansel in #2390
- Add _gelu_tanh_approx op by @jansel in #2391
- [Autotuner] Confirm suspicious subprocess timings by @choijon5 in #2377
- Attention Perf: Multiply Q in-loop to avoid memory spillage by @AmesingFlank in #2373
- Attention Perf: Transpose blocked K right before QK instead of pre-transposing before the kernel by @AmesingFlank in #2374
- [compiler][autotuner] Autotuner heuristics by @ethche in #2392
- [compile] Remove unused device_load_count by @hinriksnaer in #2395
- [compile] Eliminate two-phase initialization of HostFunction by @hinriksnaer in #2396
- Avoid dynamic shape recompiles for 0/1 tensor dimensions by @oulgen in #2353
- Add Claude Code workflow by @choijon5 in #2412
- [XPU] Disable torch.compile fusion to unblock Inductor range-symbol failures by @karthickai in #2413
- Skip unsupported fbcode and MTIA RNG tests by @choijon5 in #2414
- [cutedsl] gate tcgen05_ab_stages=3 search behind per-CTA SMEM budget by @jansel in #2400
- [cutedsl] add autotune sweep harness for CuTe examples by @jansel in #2401
- [cutedsl] fix universal-MMA lane-loop and grid codegen guards by @jansel in #2402
- [autotuner] make initial-population benchmark phase budget-aware by @jansel in #2403
- [cutedsl] predicate atomic ops on CTA-resident ghost axes by @jansel in #2404
- [cutedsl] widen c_input_warps and loop_orders autotune surface by @jansel in #2405
- [cutedsl] discover aux-tensor descriptors at tcgen05 MMA codegen by @jansel in #2406
- [cutedsl] emit c-input warp role-local while for residual scheduling by @jansel in #2407
- [cutedsl] allocate c-input warp aux pipeline without producer barrier ops by @jansel in #2408
- [cutedsl] fuse c-input warp aux prefetch into smem ring by @jansel in #2409
- [cutedsl] widen autotune surface for c_input warp on residual kernels by @jansel in #2410
- [cutedsl] fix c_input warp aux tile coords for cluster + l2_groupings by @jansel in #2411
- RoPE kernel by @ethche in #2415
- Add RoPE to nightly GPU benchmarks by @choijon5 in #2419
- Add Mamba2 and GDN to nightly GPU benchmarks by @choijon5 in #2424
- [compile] Add DeviceFunction.resolved_block_size(block_id) helper by @hinriksnaer in #2418
- Fix perf dashboard for newly added examples by @choijon5 in #2434
- Add RemoteCacheBackend ABC for pluggable remote autotune caching by @fulvius31 in #2317
- Add CuTe NVFP4 GEMV example by @oulgen in #2433
- [test/cute] use env directly in _get_mma_k_loop_info for block size resolution by @hinriksnaer in #2444
- Infer CuTe NVFP4 conversions from dtypes by @oulgen in #2437
- torch_tpu: update pin to 157713848ac0a510eb3a057c550861d999d4ec93 by @cota in #2438
- Fixing the stale global memory read for AMD GPUs. by @umechand-amd in #1845
- Make Triton do_not_specialize opt-in by @choijon5 in #2426
- [metal] eliminate trivial stride-1 multiplication in MSL codegen by @hinriksnaer in #2432
- [metal] use array subscript syntax for MSL memory access by @hinriksnaer in #2441
- [metal] format MSL signature with one parameter per line by @hinriksnaer in #2442
- Fix B200 benchmark CI failures by @choijon5 in #2455
- [Pallas] Add fused_linear_jsd and grpo_loss to TPU benchmark sweep by @norx1991 in #2421
- [Pallas] Accept kernels_tpu='all' for full-coverage TPU bench by @norx1991 in #2459
- Limit A10G CI pytest workers to two by @choijon5 in #2457
- Enable cudagraph for running examples by @choijon5 in #2461
- Skip pretuned kernel perf gating in fbcode by @choijon5 in #2463
- Optimize nvfp4 CuTe perf paths by @oulgen in #2462
- ci: declare workflow-level
contents: readon 3 workflows by @arpitjain099 in #2460 - [Autotuner] LLM search: effort_level knob + Anthropic adaptive thinking + OpenAI xhigh by @choijon5 in #2446
- [Autotuner] LLM search: Anthropic Opus 4.6/4.7 fast mode by @choijon5 in #2450
- [Autotuner] LLM search: fail loudly + mTLS gateway compatibility by @choijon5 in #2448
- [Examples] rope: print benchmark table via run_example by @choijon5 in #2451
- [Autotuner] LLM prompt: diversify num_stages/num_warps in seed batch by @choijon5 in #2465
- Add H100 (sm90) pretuned heuristics and perf gates by @choijon5 in #2454
- Add Cute benchmark by @oulgen in #2466
- [cache] Wire from_best_available / from_cache to RemoteCacheBackend by @fulvius31 in #2453
- [Pallas] Add epilogue_subtiling to TPU benchmark sweep by @norx1991 in #2458
- Fix rope-bwd cudagraph crash and tighten gdn_fwd_h accuracy gate by @choijon5 in #2467
- [Pallas] Slice store values to match clamped Pallas BlockSpec ref shape by @thcmbs in #2398
-
PR Summary: Support reductions under branch-by-grid control flow by @yushangdi in #2480
- Restore TF32 backend state by @yushangdi in #2482
- [Pallas] Add kl_div to TPU benchmark sweep and dashboard by @norx1991 in #2484
- Fix nightly perf CI by @choijon5 in #2476
- torch_tpu: update pin to 3fb6cdbd96180e69df2233db51089656b230e6b6 by @cota in #2474
- [docs] Document remote autotune cache and warm-start behavior by @fulvius31 in #2475
- [CI fix] Raise benchmark subprocess timeout to 90s in fbcode by @choijon5 in #2487
- Factor-out a bit of common logic for finding return names of if and else branches by @AmesingFlank in #2486
- [Pallas] Add FP8 dtype mappings to torch-to-JAX table by @thcmbs in #2489
- [Pallas] sympy mod printer by @thcmbs in #2490
- Fix keepdim scalar reduction reshape in Triton codegen by @yushangdi in #2483
- [Pallas] Add test for fused_linear_jsd_fwd autograd path by @norx1991 in #2456
- Only include common outputs as outputs of traced if subgraph by @AmesingFlank in #2485
- [dashboard] Suppress 'No Result' for kernels removed from workflow defaults by @choijon5 in #2492
- [Pallas] Add a helion setting for pallas interpret mode by @AmesingFlank in #2522
- [Pallas] Render outer_prefix for emit_pipeline and fori_loop scopes by @norx1991 in #2496
- [Pallas] Thread pallas_interpret through runtime launchers by @AmesingFlank in #2524
- [Pallas] Disable factory padding and preserve concrete dims by @thcmbs in #2477
- [Pallas] Support tile.index broadcast indexing in load codegen by @norx1991 in #2532
- [Pallas] Make Pallas interpret mode honor TPU constraints by @norx1991 in #2525
- [Pallas] Route meta output-only tensors to CPU under interpret by @norx1991 in #2526
New Contributors
- @allgather made their first contribution in #1994
- @svdrecbd made their first contribution in #1885
- @Hamlin-Li made their first contribution in #1996
- @yarongmu-google made their first contribution in #2024
- @will-cromar made their first contribution in #2030
- @Bodlux made their first contribution in #2036
- @xiaohongchen1991 made their first contribution in #2042
- @chuanqi129 made their first contribution in #1327
- @stmcgovern made their first contribution in #2177
- @arpitjain099 made their first contribution in #2460
Full Changelog: v1.0.0...v1.1.0