Release v1.1.0 · pytorch/helion

What's Changed

[Pallas] Use TPU-aware synchronize_device across autotuner and testing by @norx1991 in #1944
Refactor if-else branches, unblock test_if_new_variable_in_static_range for Pallas TPU by @AmesingFlank in #1935
[Pallas] Register pallas_loop_type only when inner loops exist by @norx1991 in #1915
[Pallas] Fix emit_pipeline program_id mapping with loop_order reordering by @norx1991 in #1916
[Pallas] Add expected-failure test for non-power-of-2 RDIM size by @norx1991 in #1945
Update docs: add missing API entries by @choijon5 in #1948
[Benchmark] Add compile time measurement to CI benchmarks by @choijon5 in #1952
use is_symm_mem_tensor if the API is available by @shunting314 in #1933
[Pallas] When not doing tiling for entire kernel, use explicit BlockSpecs instead of None by @AmesingFlank in #1960
Fix FROM_BEST_AVAILABLE matching with hl.specialize() after #1883 by @fulvius31 in #1940
[Pallas] Fix indexing scalars using SMEM memory space by @AmesingFlank in #1955
[Pallas] Add xfail tests for scalar .begin index not collapsing dims by @norx1991 in #1971
Skip test_hl_rand_mixed_argument_order on MTIA due to unaligned address crash by @karthickai in #1977
Fix _supports_maxnreg() to guard against non-CUDA backends by @karthickai in #1981
Skip TestRandomPhiloxParity class on MTIA (#1979) by @karthickai in #1982
[Pallas] Fix acccesing tensors using index from hl.grid() by @AmesingFlank in #1956
[Pallas] Support using traced size-1 tensor as condition predicate, unblocking test_if_arg_indexed_scalar by @AmesingFlank in #1957
[Pallas] Fix atomic_add dtype cast and VMEM preload for fori/pipeline launchers by @thcmbs in #1966
[Pallas] Use exact RDIM size instead of next-power-of-2 by @norx1991 in #1954
[Pallas] Add support for accessing tensors with the pattern of tile.index + offset /.id/.begin/.end by @AmesingFlank in #1968
[metal] Reuse Inductor's MetalOverrides for MSL expression emission by @aditvenk in #1853
[metal] Add Metal codegen handlers for load, store, and mask_to by @aditvenk in #1854
[Pallas] Enable a subset of test_grid tests for Pallas by @AmesingFlank in #1985
[Pallas] Add a test for accessing tensors with hl.grid() index + offset by @AmesingFlank in #1988
Relax rms_norm example tolerance for Pallas bf16 by @thcmbs in #1983
[autotuner] Introduce BenchmarkProvider abstraction for kernel benchmarking by @hinriksnaer in #1928
[Pallas] Use HBM BlockSpecs for output-only tensors to save VMEM by @norx1991 in #1984
Emit autotune failure summary warnings by @allgather in #1994
[Pallas] Remove unused is_device_loop variable in _pallas_index_str by @AmesingFlank in #1990
[Pallas] Fix BlockSpecs for 2D tl.grid([m, n]), unblocking test_scalar_access_hl_grid_2d by @AmesingFlank in #1986
Reduce some autotuner overhead without changing kernel behavior by @svdrecbd in #1885
[compile] Add pre_codegen hook to Backend ABC by @hinriksnaer in #1976
Adding AMD Mi350x machines to CI with new labels. by @umechand-amd in #1835
[Pallas] Fix accessing tensor via hl.grid() index within a device loop, unblocking test_scalar_access_hl_grid_2d_nested by @AmesingFlank in #1989
Support mtia in LocalAutotuneCache by @Hamlin-Li in #1996
removed stale comments by @hinriksnaer in #1997
Fix invalid default config for kernels with large tensor numel by @fulvius31 in #1839
chore: Bump actions/github-script from 8 to 9 by @dependabot[bot] in #2000
Fix _benchmark dropping configs that fail compilation by @fulvius31 in #1942
[metal] Add MSL AST walker for Python-to-C++ translation by @aditvenk in #1794
Enable tensor_descriptor based atomic ops by @ethche in #1953
[metal] Add @metal_jit decorator for AST-to-MSL compilation by @aditvenk in #1991
Epilogue subtiling: store indexing fix, example, and tuple output support in run_example by @choijon5 in #1907
[Compiler] Added backend registry by @hinriksnaer in #1967
[metal] Wire @metal_jit into MetalBackend and simplify launcher by @aditvenk in #1992
[Pallas] Skip trivial reduction mask when RDIM size equals actual dim by @norx1991 in #1993
[Pallas] Fix TPU min_dot_size for matmul autotuning by @norx1991 in #1999
chore: Bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #2015
Make _get_tile_with_offset_info accept a torch.fx.Node as arg, instead of the entire CodegenState by @AmesingFlank in #2005
[metal] Skip tests better to pass in internal CI by @aditvenk in #2016
Add to_code() as backend-agnostic alias for to_triton_code() by @norx1991 in #2012
[Pallas] Fix fori_loop multi-dim index decomposition with nested loops by @norx1991 in #1917
custom config filter by @shunting314 in #1847
[Pallas] Reject 64-bit input tensors and fix tiling ZeroDivisionError by @norx1991 in #1950
Bump CI libtpu to 0.0.40 by @AmesingFlank in #2017
[Pallas] Make test_tensor_access_tile_index_offset more meaningful and reveal codegen issue in Pallas backend by @AmesingFlank in #2006
[Pallas] [NFC] Dedup _jnp_dtype_map helper by @thcmbs in #2021
[Pallas] Add a plan_tiling pre_codegen pass to make more consistent tiling and indexing decisions by @AmesingFlank in #2007
[Compiler] Add reserved_launch_param_names to Backend ABC by @hinriksnaer in #1970
[Pallas] Remove non-tiled fallback paths which are no longer used after #2007 by @AmesingFlank in #2026
[TPU][Pallas] Add lower bound analytical VMEM estimation and OOM guard for Pallas launchers by @yarongmu-google in #2024
[Pallas] Fix codegen for slice indexing when there are squeezed dimensions by @AmesingFlank in #2027
[CI] Pin Triton-to-tile-IR to last known-good commit by @norx1991 in #2028
[Pallas] Add non-DMA fori_loop fallback for DMA-unaligned inner blocks by @thcmbs in #1969
Skip test_matmul_smaller_than_min_dot_size on MTIA (#2030) by @will-cromar in #2030
docs: add missing autofunction directives for 9 API functions by @Bodlux in #2036
[AMD ROCM] matrix_instr_nonkdim restricted to 16 by @umechand-amd in #2032
[Autotuner] Move benchmarking implementation into BenchmarkProvider by @hinriksnaer in #2029
[Bug][Autotuner] Autotuner failed to clone mutated input arg by @xiaohongchen1991 in #2042
[TPU][Pallas] Fix OOM on large reductions by delegating chunking to Mosaic by @yarongmu-google in #2033
Remove ref baseline kernel count for test_clone_with_multiple_views_one_mutated as it depends on PyTorch version by @choijon5 in #2037
docs: remove remaining mentions of HELION_USE_AUTOTUNE by @cota in #2038
Fix test broken by #2033: use env.backend_name by @norx1991 in #2046
Simplify CI: update PyTorch releases to 2.11, use bundled Triton by @choijon5 in #2047
[Pallas] Under interpret mode, Use float16 as HALF_DTYPE because bfloat16 is not supported on CPU by @AmesingFlank in #2050
[AMD ROCm] Use AMD Triton backend for min_dot_size instead of NVIDIA by @choijon5 in #2048
[Pallas] Adjust block size constraints by analyzing subscript exprs on tensors, unblock rms_norm_bwd example and a few tests by @AmesingFlank in #2051
ND jagged tile support by @nullplay in #2052
New performance dashboard with GitHub Pages deployment by @choijon5 in #2053
[Pallas] Fix test_squeeze_slice_access to use code_and_output by @norx1991 in #2057
Add Metal test job to CI test matrix by @aditvenk in #1862
[Pallas] Add xfail tests for bmm non-divisible reduction by @norx1991 in #2031
[Autotuner] Adding LLM-guided search by @choijon5 in #2003
[cutedsl] Refactor reductions to use helper methods by @jansel in #2008
[cutedsl] Strengthen layout planning pass invariants by @jansel in #2009
[cutedsl] Improve dot with epilogue handling by @jansel in #2014
[cutedsl] Plan grouped-N matmuls and lower atomic tensor indices by @jansel in #2020
[Pallas] Exclude output-only tensors from Pallas pallas_call inputs to improve performance by @norx1991 in #1849
[Pallas] Use FakeTensorMode to avoid HBM allocation for output-only tensors by @norx1991 in #2022
Add simplified se_block kernel (#989) by @mengluy0125 in #989
[Pallas] Fix symbolic offset codegen in TileIndexWithOffsetPattern by @norx1991 in #2068
[Pallas] Replace FakeTensorMode wrap with device='meta' for output-only tensors by @norx1991 in #2071
[TPU][Pallas] Enable TPU support and fix benchmarking for AOT compilation example by @yarongmu-google in #2059
fix misleading benchmarking for fp8 gemm by @shunting314 in #1980
add flashinfer allreduce-rmsnorm kernel by @shunting314 in #2063
[Autotuner] Make the autotuner robust to InvalidConfig by @bringlein in #2039
Deploy new perf dashboard to GitHub Pages by @choijon5 in #2066
Add backend-agnostic lane loop APIs to tile strategies by @aditvenk in #1798
[Pallas] Process tensor access within external lambdas when adjusting block size constraints by @AmesingFlank in #2073
[Pallas] Fix emit_pipeline/fori_loop codegen when multiple inner loops tile the same dim by @norx1991 in #2075
use non-interleaved benchmarking for all-reduce-rmsnorm by @shunting314 in #2065
Unify dashboard deployment with docs deploy by @choijon5 in #2082
[Pallas] Refactor memory space tracking into PallasMemorySpace enum by @norx1991 in #2072
Dashboard: restrict Overview/Speedup to main branch, track latency, UI polish by @choijon5 in #2084
[Pallas] Treat tile.id subscripts as untileable scalar indices by @norx1991 in #2083
[Pallas] Validate pallas_loop_type by @thcmbs in #2055
[NFC] [Pallas] Move indexing codegen helpers to pallas/codegen.py by @thcmbs in #2067
[Autotuner] Reland LLM-seeded hybrid search (originally #2004) by @choijon5 in #2091
Dashboard: fix crash-masking in CI status, split failures (accuracy/run/infra), chart polish by @choijon5 in #2086
[Pallas] More robust analysis of tensors reads/writes via FX graph instead of AST, allowing more aggresive output_only optimizations by @AmesingFlank in #2088
[Autotuner] Skip temperature for claude-opus-4-7 (HTTP 400) by @choijon5 in #2089
[Autotuner] Raise LLM response token budget to fit verbose configs by @choijon5 in #2090
Dashboard: dedupe MI325X duplicate entries, rename Runner Failures → No Result with last-seen date by @choijon5 in #2094
Dashboard: include cancelled runs so partial artifacts aren't lost to 6h timeout by @choijon5 in #2095
Fix hardcoded CUDA device in jagged_dense_bmm example by @norx1991 in #2080
[CI] fix 13.2 TileIR ci pipeline by @qelk123 in #2076
allow reuse variables across different static loops by @shunting314 in #2081
[Pallas] Fix SMEM/VMEM conflict for tensors with mixed access patterns by @norx1991 in #2069
Removing MI325 runners from CI by @umechand-amd in #2096
Benchmark: Reduce running Tritonbench for each kernel from twice to once. by @choijon5 in #2097
[ROCM ] Improves ROCm compatibility for distributed kernels and expands ROCm test coverage in distributed test suites. by @umechand-amd in #2049
[dashboard] temporarily remove --existing-url to rebuild cache by @choijon5 in #2109
Pin tritonbench commit in a file. by @umechand-amd in #2106
[dashboard] re-enable --existing-url and filter platforms from dispatch workflow by @choijon5 in #2110
Fix flaky distributed and torch.compile tests by @choijon5 in #2098
[Benchmarking/CI] Prevent hangs in benchmark phase via subprocess + per-config run timeout by @choijon5 in #2111
[Autotuner] Seed LFBO surrogate with stage-1 LLM benchmarks in hybrid search by @choijon5 in #2113
[dashboard] Fix empty dashboard caused by broken GitHub API query filters by @choijon5 in #2118
[Pallas] Host-side padding for non-divisible pl.ds() dimensions by @norx1991 in #2104
Update tritonbench.txt by @umechand-amd in #2121
Fix flaky test due to spurious NaN in fusion autotune accuracy check by @choijon5 in #2115
[Autotuner] Skip subprocess sticky CUDA errors instead of aborting autotune by @choijon5 in #2122
[Pallas] Adding a tunable pre_broadcast optimization pass for TPU scratch buffers, improving TPU attention perf by @AmesingFlank in #2103
[dashboard] Trigger docs-deploy explicitly from benchmark dispatch by @choijon5 in #2129
[Pallas] Fix multi-dim padding overwriting original tensor reference by @norx1991 in #2120
[Pallas] Extend padding to fori_loop DMA and emit_pipeline via _record_pad_info by @norx1991 in #2105
[Autotuner] Handle PyTorch CUDA OOM as a skippable error instead of aborting autotuning by @ethche in #2130
[Pallas] Indirect gather with pluggable strategies by @thcmbs in #2054
[lint] upgrade to pyrefly 0.63.1 by @oulgen in #2134
[lint] upgrade ruff to 0.15.12 by @oulgen in #2135
[Pallas] Add pl.multiple_of alignment hint to pl.ds() offsets by @norx1991 in #2116
Reduce measure() overhead when compile-time tracking is disabled by @gmagogsfm in #2139
[CI] Enable build and test for XPU by @chuanqi129 in #1327
[Autotuner] Catch expected errors during fork precompiler setup instead of aborting by @ethche in #2142
Optimize cache key computation overhead by @gmagogsfm in #2144
[Pallas] Honor _smem_arg_indices in pipeline launchers by @norx1991 in #2143
This PR updates the tritonbench commit to its current ToT where the PR to fix the segfaults has landed. by @umechand-amd in #2149
[Benchmarking] Disable cudagraph for layer_norm-bwd / rms_norm-bwd by @choijon5 in #2127
[docs] Add dashboard link, LLM autotuner docs, remove past events by @choijon5 in #2151
[Pallas] Per-dim VMEM accounting for gather budget check by @thcmbs in #2137
[cute] Enable TestControlFlow by @oulgen in #2136
[cute] Add codegen for hl.split, hl.join, and aten.view.dtype; enable test_views.py by @oulgen in #2138
[Pallas] Rename pallas_loop_type "default" to "unroll" by @norx1991 in #2155
Remove redundant m_i update in example attention kernel by @AmesingFlank in #2156
Use integer arithmetic instead of triton.cdiv in launcher by @gmagogsfm in #2146
unbreak docs build by @oulgen in #2162
[Pallas] Use torch.addmm in matmul_layernorm K-loop by @norx1991 in #2141
[cute] Enable bunch of test suites by @oulgen in #2159
Minor runpod updates by @jansel in #2163
Update AGENTS.md by @jansel in #2164
Add cute-verify skill by @jansel in #2165
Add scripts/autoreview.py by @jansel in #2166
Run codespell from ./lint.sh by @jansel in #2175
[cutedsl] Matmul preformance prework by @jansel in #2167
Small attention optimization: pre-scale q tile with qk_scale by @AmesingFlank in #2157
[Pallas] Tighten _check_dma_alignment + make "unroll" tests explicit by @norx1991 in #2158
[cute] Bump minimum cute version to 4.5 by @oulgen in #2180
Fix compile time measurements by @choijon5 in #2188
torch_tpu: update pin to 28d941aec27 by @cota in #1895
Remove redundant compile time env by @choijon5 in #2191
[Pallas] Emit offset/indices at inner-loop body prologue by @norx1991 in #2181
[Autotuner] Fix crash when autotuner_min exceeds max_size by @stmcgovern in #2177
Fix attention benchmark accuracy by @choijon5 in #2178
[Pallas] Enable se_block tests on TPU + simplify skipIfCudaCapabilityLessThan by @norx1991 in #2131
[cute] Implement topk and sort by @oulgen in #2160
Fix negative shift by @oulgen in #2185
[flaky test] skip register cache test on XPU by @choijon5 in #2192
Add fix-pr skill by @jansel in #2189
Add offsets kwarg to hl.rand for explicit Philox offsets by @karthickai in #2153
[Pallas] Per-tensor pipelining decision in fori_loop and emit_pipeline by @norx1991 in #2093
[cute] Implement associate scan by @oulgen in #2161
Add helion.from_cache() for FiniteSearch warm-start by @fulvius31 in #2079
[cute] Enable test_print by @oulgen in #2186
[BoundKernel] added _normalize_config by @hinriksnaer in #2152
[cutedsl] Improve CuTe tcgen05 matmul autotuning and direct-store epilogues by @jansel in #2168
[cutedsl] Compile CuTe launchers once and harden regression coverage by @jansel in #2169
[cutedsl] track tcgen05 per-tile setup and register split by @jansel in #2170
[cutedsl] add autotune wall-time budget by @jansel in #2171
[cutedsl] split tcgen05 persistent post-loop cleanup by @jansel in #2172
[cutedsl] prune dead tcgen05 role scaffolding by @jansel in #2173
[cutedsl] simplify tcgen05 layout plan by @jansel in #2174
[cutedsl] guard tcgen05 persistent multi-tile at runtime by @jansel in #2193
Fix flash attention benchmark CI by @choijon5 in #2207
[cute] enable test unroll tuples by @oulgen in #2187
[HostFunction] Extract _parse_source from Hostfunction.__init__ by @hinriksnaer in #2154
Dashboard: latency-as-default-graph, noise muting, platform sync, color fixes by @choijon5 in #2208
[cutedsl] split tcgen05 persistent setup into layout + prelude + tile body by @jansel in #2194
[cutedsl] add tcgen05 persistent role-block scaffolding + TMA-load tagging by @jansel in #2195
[cutedsl] split tcgen05 per-K-iter TMA producer/consumer block by @jansel in #2196
[cutedsl] recurse partitioner into K-loop body for tcgen05 TMA producer by @jansel in #2197
[cutedsl] split tcgen05 per-K-iter TMA builders into named helpers by @jansel in #2198
[cutedsl] split tcgen05 initial-prefetch IF emission into AST helper by @jansel in #2199
[cutedsl] dedupe tcgen05 codegen test mocks via _testing helpers by @jansel in #2200
[cutedsl] extract tcgen05 multi-tile guard var/message into class constants by @jansel in #2201
[cutedsl] consolidate cute reduction branches and tcgen05 autotune narrowing by @jansel in #2202
[cutedsl] extract _count_rdim_axes_in_val helper in roll_reduction by @jansel in #2203
[cutedsl] narrow tcgen05_num_epi_warps autotune to (4,) to avoid wrong output by @jansel in #2204
[cutedsl] reject tcgen05_num_epi_warps != 4 at codegen + diagnose root cause by @jansel in #2205
[cutedsl] add tcgen05 role-local-while builder infrastructure (3b-prep-4) by @jansel in #2206
Add missing onlyBackends([cute]) by @jansel in #2219
[cute] Enable test_indexing by @oulgen in #2210
[cute] Add basic autotuning capabilities by @oulgen in #2221
[Pallas] Use exprs from AST instead of SymPy exprs when generating loop bounds by @AmesingFlank in #2211
[Pallas] When there are data-dependent loop bounds, also use fori_loop instead of unroll by @AmesingFlank in #2212
[Pallas] Fix out-of-bound DMA caused by tiles from non-zero begins by @AmesingFlank in #2213
[Pallas] Apply tile masks at load time to zero out-of-bounds data by @AmesingFlank in #2214
Temporarily disable XPU CI by @jansel in #2270
Support HELION_AUTOTUNE_EFFORT=none HELION_FORCE_AUTOTUNE=1 by @oulgen in #2265
Retry more flakey job types by @jansel in #2269
[cutedsl] split persistent tma producer role by @jansel in #2224
[cutedsl] split persistent mma exec role by @jansel in #2225
[cutedsl] split persistent epi role by @jansel in #2226
[cutedsl] validate tcgen05 persistent z-grid by @jansel in #2227
[AOT] Suppress heuristic cache hit messages by default by @choijon5 in #2220
[Pallas] Fix DMA scratch buffer offset bug in nested fori_loop codegen by @AmesingFlank in #2217
[Pallas] Fix scratch Ref scoping bug in fori_loop/emit_pipeline codegen by @AmesingFlank in #2215
[cutedsl] restore scoped persistent pid autotune by @jansel in #2228
[cutedsl] Add flat tcgen05 TMA store epilogue by @jansel in #2229
[cutedsl] Add persistent tcgen05 TMA store epilogue by @jansel in #2230
[cutedsl] Close out tcgen05 TMA store acquire ordering by @jansel in #2231
[cutedsl] add guarded CtaGroup.TWO structural codegen by @jansel in #2232
[cutedsl] align two-CTA AB pipeline ownership by @jansel in #2233
[cutedsl] align two-CTA TMEM setup ordering by @jansel in #2234
[cutedsl] align two-CTA scheduler publication by @jansel in #2235
[cutedsl] advance guarded two-CTA role-local codegen by @jansel in #2236
[cutedsl] omit two-cta shared scheduler loop by @jansel in #2237
[cutedsl] Guard tcgen05 omit-shared scalar setup by @jansel in #2238
[cutedsl] Adjust two-CTA TMEM teardown by @jansel in #2239
[cutedsl] split tma tail capability check by @jansel in #2240
Fix tensor descriptor silent fallback for scalar SymInt subscripts by @ethche in #2222
[cutedsl] defer two-cta pipeline constructor sync by @jansel in #2241
[cutedsl] validate single-tile two-cta runtime by @jansel in #2242
[cutedsl] admit non-recycling two-cta tiles by @jansel in #2243
[cutedsl] validate shallow-k two-cta direct grid by @jansel in #2244
[cutedsl] validate long-k two-cta direct grid by @jansel in #2245
[cutedsl] enable CtaGroup.TWO TMA-store epilogue by @jansel in #2246
[cutedsl] elide CtaGroup.TWO role schedulers by @jansel in #2247
[cutedsl] restore two-cta scheduler recycling by @jansel in #2248
Skip grid td xpu by @ethche in #2275
[cutedsl] re-enable two-cta autotune search by @jansel in #2249
[cutedsl] seed two-cta autotune search by @jansel in #2250
[cutedsl] prune two-cta autotune failures by @jansel in #2251
[cutedsl] seed two-cta l2 grouping by @jansel in #2252
[cutedsl] seed two-cta tensor indexing by @jansel in #2253
[cutedsl] trim tcgen05 epilogue barrier by @jansel in #2255
[cutedsl] add two-cta pdl markers by @jansel in #2258
Temporarily disable pallas CI until upstream torch_tpu is fixed by @jansel in #2281
TPU CI: Use PyTorch nightly from 20260502 instead of most recent nightly to unblock CI by @AmesingFlank in #2298
[Pallas] Cast bool masks to float before expanding in _mask_to codegen by @AmesingFlank in #2216
[Pallas] Skip fp32 fallback for unary transcendentals on TPU by @norx1991 in #2268
[Pallas] Add xfail tests for BMM with non-zero K begin by @norx1991 in #2271
[Pallas] Fix pre-broadcasting transformation bug when non-broadcast dims exceed PRE_BROADCAST_SIZE by @AmesingFlank in #2223
[Pallas] Lower hl.zeros / hl.full to plain jnp.full by @norx1991 in #2278
[Pallas] Fix failing scratch shapes asserts due to land-time race when #2278 caused scratch shapes to be re-ordered by @AmesingFlank in #2302
[language] Add hl.rand4x for 4-output Philox RNG by @karthickai in #2283
Fix failing cutlass lints by @AmesingFlank in #2303
[xpu] Disable proton build for XPU by @Stonepia in #2300
[Pallas] Use dot_general instead of matmul for Pallas codegen by @AmesingFlank in #2299
[Autotuner] Enable autotuner seed configs by @ethche in #2276
update torch_tpu pin to a1ef0dd7fa2ffb730995e31953d1b5d316226c96 by @cota in #2316
TPU CI: Restore to using latest nightly pytorch by @AmesingFlank in #2320
[Pallas] Use jax_export_ignore_forward_compatibility=True when exporting JaxCallable, improving attention perf by @AmesingFlank in #2323
[Pallas] Make pallas_pre_broadcast a tunable autotune fragment by @norx1991 in #2324
Fix cute CI failures by @jansel in #2325
[cute] Pin to official 4.5.0 by @oulgen in #2326
[Pallas] Integrate TPU benchmarks into Benchmark Dispatch + dashboard by @norx1991 in #1913
remove obsolete _init_tpu_device helper by @thcmbs in #2319
Enable cute for error tests by @oulgen in #2339
[cutedsl] fix race in scalar store with full-slice subscript by @jansel in #2327
[cutedsl] reorder and hoist tcgen05 C-store epilogue by @jansel in #2328
[cutedsl] prefetch tcgen05 consumer token, reset UMMA accumulate per tile, seed pid order by @jansel in #2329
[cutedsl] add tcgen05 C-store / acc-wait / skip-UMMA / cubin-lineinfo diagnostic knobs by @jansel in #2330
[cutedsl] add tcgen05 split-first and store-tail T2R epilogue diagnostics by @jansel in #2331
[cutedsl] add tcgen05 module-helper T2R epilogue diagnostics by @jansel in #2332
[cutedsl] add tcgen05 role-local bridge codegen and tests by @jansel in #2333
[cutedsl] add tcgen05 role-local bridge pipeline, TMA mask, and ownership by @jansel in #2334
[cutedsl] add tcgen05 bridge AB acc-advance and acquire diagnostics by @jansel in #2335
[cutedsl] add tcgen05 bridge AB wait, phase, and initial-acquire diagnostics by @jansel in #2336
[cutedsl] add tcgen05 larger-BN codegen test by @jansel in #2337
Enable cute for int64 indexing tests by @oulgen in #2340
Enable cute for cache tests by @oulgen in #2341
Enable cute for autotune tests by @oulgen in #2342
Add reduction support to helion autodiff by @karthickai in #1747
Dashboard: only nightly runs populate Overview; manual dispatches stay in Compare by @choijon5 in #2355
[Pallas] Don't pipeline tensors read or written outside the inner loop by @norx1991 in #2284
[tutorials] pretuned Helion examples by @choijon5 in #2209
[docs] Add AOT autotuning documentation by @choijon5 in #2274
Enable cute for stack tensor tests by @oulgen in #2349
Enable cute for jagged tile tests by @oulgen in #2350
Enable cute for epilogue subtiling tests by @oulgen in #2352
[Pallas] Align TPU kernel names with GPU dashboard + add compile-time measurement by @norx1991 in #2354
[runtime:pallas] consolidate pallas_aliases computation in _pallas_prepare_args by @cota in #2348
[docs] Fix broken tutorials/ links in AOT autotuning doc by @norx1991 in #2358
Enable tensor_descriptor for static_shapes = False by @ethche in #2356
[compile] Introduce KernelCompiler for pipeline orchestration by @hinriksnaer in #2267
[Pallas] Default pallas_loop_type to emit_pipeline by @norx1991 in #2321
ci: bump TPU jax/jaxlib pin to 0.10.0 by @thcmbs in #2361
Dashboard: fix nightly detection and restrict Overview to main branch by @choijon5 in #2363
Dashboard: add geomean footer rows to Speedup and Compare tables by @choijon5 in #2364
[compile] Add create_reduction_strategy() and adjust_reduction_thread_count to Backend by @hinriksnaer in #2318
Dashboard: don't flag manual-only kernels as infra_missing by @norx1991 in #2366
Support lowering torch.bmm(..., dtype=), use it for attention to avoid redundant fp32 -> fp16 -> fp32 roundtrip by @AmesingFlank in #2365
[cute] Generalize sythetic lane loops and loop-carried accumulator checks by @hinriksnaer in #2347
[Pallas] Make per-tile element cap a backend hook, disable for Pallas by @norx1991 in #2282
[Pallas] Add larger shapes for attention and matmul_layernorm benchmarks by @norx1991 in #2368
Docs: merge AOT Autotuning into Deployment guide; list pretuned kernels under examples. by @choijon5 in #2369
Relax RMS pretuned kernel perf wins gate by @choijon5 in #2371
[Bechmarking] Default to benchmark subprocess by @choijon5 in #2372
[Autotuner] Clear Triton JIT fast-path caches after benchmarks by @ethche in #2367
Support dynamic TD guards for container tensors by @ethche in #2370
Increase effort for autoreview by @jansel in #2375
[Dashboard] Use paired geomeans in comparisons by @choijon5 in #2376
update torch_tpu pin to 104763049fe1df6834605fed1cd2b79434ea02d5 by @cota in #2362
Use HALF_DTYPE in epilogue subtiling example by @thcmbs in #2343
Relax test_pretuned_kernels.py targets by @jansel in #2379
[pallas] remove remaining torch_tpu.api usage by @cota in #2345
rms_norm: save inv_rms in fp32 to fix bf16 backward by @thcmbs in #2273
[cutedsl] Add Tcgen05Strategy/WarpSpec data model and config keys by @jansel in #2380
[cutedsl] Drive matmul warp roles from generated warp-spec records by @jansel in #2381
[cutedsl] Implement ROLE_LOCAL_WITH_SCHEDULER matmul strategy by @jansel in #2382
[cutedsl] Pin output dtype on matmul plan for epilogue tile shape by @jansel in #2383
[cutedsl] Add CLC-persistent tile scheduler for matmul by @jansel in #2384
[cutedsl] Support cluster_n=2 in matmul lowering by @jansel in #2385
[cutedsl] Add unary epilogue chain analyzer and splicing by @jansel in #2386
[cutedsl] Fuse auxiliary tensor loads in epilogue chains by @jansel in #2387
[cutedsl] Enable cluster_n=2 under role-local scheduler by @jansel in #2388
[cutedsl] Gate cluster_m=2 search by wave quantization and broaden epi-fusion shapes by @jansel in #2389
[cutedsl] Add A/B SMEM and L2 scheduler swizzle controls by @jansel in #2390
Add _gelu_tanh_approx op by @jansel in #2391
[Autotuner] Confirm suspicious subprocess timings by @choijon5 in #2377
Attention Perf: Multiply Q in-loop to avoid memory spillage by @AmesingFlank in #2373
Attention Perf: Transpose blocked K right before QK instead of pre-transposing before the kernel by @AmesingFlank in #2374
[compiler][autotuner] Autotuner heuristics by @ethche in #2392
[compile] Remove unused device_load_count by @hinriksnaer in #2395
[compile] Eliminate two-phase initialization of HostFunction by @hinriksnaer in #2396
Avoid dynamic shape recompiles for 0/1 tensor dimensions by @oulgen in #2353
Add Claude Code workflow by @choijon5 in #2412
[XPU] Disable torch.compile fusion to unblock Inductor range-symbol failures by @karthickai in #2413
Skip unsupported fbcode and MTIA RNG tests by @choijon5 in #2414
[cutedsl] gate tcgen05_ab_stages=3 search behind per-CTA SMEM budget by @jansel in #2400
[cutedsl] add autotune sweep harness for CuTe examples by @jansel in #2401
[cutedsl] fix universal-MMA lane-loop and grid codegen guards by @jansel in #2402
[autotuner] make initial-population benchmark phase budget-aware by @jansel in #2403
[cutedsl] predicate atomic ops on CTA-resident ghost axes by @jansel in #2404
[cutedsl] widen c_input_warps and loop_orders autotune surface by @jansel in #2405
[cutedsl] discover aux-tensor descriptors at tcgen05 MMA codegen by @jansel in #2406
[cutedsl] emit c-input warp role-local while for residual scheduling by @jansel in #2407
[cutedsl] allocate c-input warp aux pipeline without producer barrier ops by @jansel in #2408
[cutedsl] fuse c-input warp aux prefetch into smem ring by @jansel in #2409
[cutedsl] widen autotune surface for c_input warp on residual kernels by @jansel in #2410
[cutedsl] fix c_input warp aux tile coords for cluster + l2_groupings by @jansel in #2411
RoPE kernel by @ethche in #2415
Add RoPE to nightly GPU benchmarks by @choijon5 in #2419
Add Mamba2 and GDN to nightly GPU benchmarks by @choijon5 in #2424
[compile] Add DeviceFunction.resolved_block_size(block_id) helper by @hinriksnaer in #2418
Fix perf dashboard for newly added examples by @choijon5 in #2434
Add RemoteCacheBackend ABC for pluggable remote autotune caching by @fulvius31 in #2317
Add CuTe NVFP4 GEMV example by @oulgen in #2433
[test/cute] use env directly in _get_mma_k_loop_info for block size resolution by @hinriksnaer in #2444
Infer CuTe NVFP4 conversions from dtypes by @oulgen in #2437
torch_tpu: update pin to 157713848ac0a510eb3a057c550861d999d4ec93 by @cota in #2438
Fixing the stale global memory read for AMD GPUs. by @umechand-amd in #1845
Make Triton do_not_specialize opt-in by @choijon5 in #2426
[metal] eliminate trivial stride-1 multiplication in MSL codegen by @hinriksnaer in #2432
[metal] use array subscript syntax for MSL memory access by @hinriksnaer in #2441
[metal] format MSL signature with one parameter per line by @hinriksnaer in #2442
Fix B200 benchmark CI failures by @choijon5 in #2455
[Pallas] Add fused_linear_jsd and grpo_loss to TPU benchmark sweep by @norx1991 in #2421
[Pallas] Accept kernels_tpu='all' for full-coverage TPU bench by @norx1991 in #2459
Limit A10G CI pytest workers to two by @choijon5 in #2457
Enable cudagraph for running examples by @choijon5 in #2461
Skip pretuned kernel perf gating in fbcode by @choijon5 in #2463
Optimize nvfp4 CuTe perf paths by @oulgen in #2462
ci: declare workflow-level contents: read on 3 workflows by @arpitjain099 in #2460
[Autotuner] LLM search: effort_level knob + Anthropic adaptive thinking + OpenAI xhigh by @choijon5 in #2446
[Autotuner] LLM search: Anthropic Opus 4.6/4.7 fast mode by @choijon5 in #2450
[Autotuner] LLM search: fail loudly + mTLS gateway compatibility by @choijon5 in #2448
[Examples] rope: print benchmark table via run_example by @choijon5 in #2451
[Autotuner] LLM prompt: diversify num_stages/num_warps in seed batch by @choijon5 in #2465
Add H100 (sm90) pretuned heuristics and perf gates by @choijon5 in #2454
Add Cute benchmark by @oulgen in #2466
[cache] Wire from_best_available / from_cache to RemoteCacheBackend by @fulvius31 in #2453
[Pallas] Add epilogue_subtiling to TPU benchmark sweep by @norx1991 in #2458
Fix rope-bwd cudagraph crash and tighten gdn_fwd_h accuracy gate by @choijon5 in #2467
[Pallas] Slice store values to match clamped Pallas BlockSpec ref shape by @thcmbs in #2398
PR Summary: Support reductions under branch-by-grid control flow by @yushangdi in #2480
Restore TF32 backend state by @yushangdi in #2482
[Pallas] Add kl_div to TPU benchmark sweep and dashboard by @norx1991 in #2484
Fix nightly perf CI by @choijon5 in #2476
torch_tpu: update pin to 3fb6cdbd96180e69df2233db51089656b230e6b6 by @cota in #2474
[docs] Document remote autotune cache and warm-start behavior by @fulvius31 in #2475
[CI fix] Raise benchmark subprocess timeout to 90s in fbcode by @choijon5 in #2487
Factor-out a bit of common logic for finding return names of if and else branches by @AmesingFlank in #2486
[Pallas] Add FP8 dtype mappings to torch-to-JAX table by @thcmbs in #2489
[Pallas] sympy mod printer by @thcmbs in #2490
Fix keepdim scalar reduction reshape in Triton codegen by @yushangdi in #2483
[Pallas] Add test for fused_linear_jsd_fwd autograd path by @norx1991 in #2456
Only include common outputs as outputs of traced if subgraph by @AmesingFlank in #2485
[dashboard] Suppress 'No Result' for kernels removed from workflow defaults by @choijon5 in #2492
[Pallas] Add a helion setting for pallas interpret mode by @AmesingFlank in #2522
[Pallas] Render outer_prefix for emit_pipeline and fori_loop scopes by @norx1991 in #2496
[Pallas] Thread pallas_interpret through runtime launchers by @AmesingFlank in #2524
[Pallas] Disable factory padding and preserve concrete dims by @thcmbs in #2477
[Pallas] Support tile.index broadcast indexing in load codegen by @norx1991 in #2532
[Pallas] Make Pallas interpret mode honor TPU constraints by @norx1991 in #2525
[Pallas] Route meta output-only tensors to CPU under interpret by @norx1991 in #2526

New Contributors

@allgather made their first contribution in #1994
@svdrecbd made their first contribution in #1885
@Hamlin-Li made their first contribution in #1996
@yarongmu-google made their first contribution in #2024
@will-cromar made their first contribution in #2030
@Bodlux made their first contribution in #2036
@xiaohongchen1991 made their first contribution in #2042
@chuanqi129 made their first contribution in #1327
@stmcgovern made their first contribution in #2177
@arpitjain099 made their first contribution in #2460

Full Changelog: v1.0.0...v1.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.1.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

PR Summary: Support reductions under branch-by-grid control flow by @yushangdi in #2480

New Contributors

Contributors

Uh oh!