Release v0.3.1 · pytorch/helion

What's Changed

[Autotuner] Use chunked comparison in autotuner accuracy checks to reduce peak memory by @yf225 in #1538
one_shot_allreduce_bias_rmsnorm: alloc&rendezvous symm memory only once by @shunting314 in #1525
Add Horace as Core Maintainer by @oulgen in #1552
add two-shot all redcue rms norm kernel by @shunting314 in #1526
Update bazel build command to use remote caching by @malfet in #1550
Add BlockSpec support and proper indexing to Pallas launcher by @oulgen in #1548
Add Pallas codegen for prims.iota by @oulgen in #1556
[CI] Fix AMD error on test_default_block_sizes_high_dim_with_reduction by @yf225 in #1546
[Autotuner] Eliminate self._original_args clone to reduce peak memory by @yf225 in #1547
sync bench/rebench result across ranks by @shunting314 in #1542
Skip failing test on MTIA by @Myrthan in #1558
[Cache] Defer some initialization in autotuner to skip unnecessary work on cache hits by @fulvius31 in #1557
Enable Pallas backend for test_control_flow with lax.cond by @oulgen in #1554
Add Pallas atomic ops -- everything is atomic by default by @oulgen in #1560
Add Pallas emit_pipeline by @oulgen in #1561
Move do_bench/sync_object from _testing to autotuner/benchmarking to fix docs build by @oulgen in #1562
Improve launcher overhead by @oulgen in #1563
fix mock in debug utils - hopefully make ci less flaky by @v0i0 in #1568
fix fp8 autotuner fail for ci by @v0i0 in #1569
fix some indexing by @v0i0 in #1453
chore: Bump actions/upload-artifact from 6 to 7 by @dependabot[bot] in #1579
chore: Bump actions/download-artifact from 7 to 8 by @dependabot[bot] in #1580
Build TorchTPU in opt mode by @oulgen in #1573
Add Pallas fori_loop with async DMA and pallas_loop_type config by @oulgen in #1574
Enable waves_per_eu tunable on RDNA GPUs by @fulvius31 in #1571
[cutedsl] Cross-warp reductions by @jansel in #1543
sync seeds across ranks by @shunting314 in #1555
[cutedsl] Add more reduction types by @jansel in #1565
[cutedsl] Tuple reductions by @jansel in #1566
Fix CPU tests by @jansel in #1582
Add Pallas reduction support and enable test_reductions for TPU by @oulgen in #1549
Tell AGENTS to not run git push by @oulgen in #1592
partition aot tuning by config validity by @v0i0 in #1581
bump pytorch to 2.10 in benchmark nightly to unblock grouped_gemm tritonbench by @v0i0 in #1597
fix benchmark parsing by @v0i0 in #1596
make sure persistent setup code uses index type for total pids by @v0i0 in #1595
Fix issue with branching and static_range by @jansel in #1586
Update AGENTS.md assertExpectedJournal reference by @jansel in #1587
[cutedsl] Enable more test files by @jansel in #1583
[cutedsl] Fix issue with sympy printer by @jansel in #1585
[Cache] Fix cross-backend cache poisoning by adding backend to cache key by @fulvius31 in #1593
[Helion + torch.compile] Add additional inductor fusion tests by @yf225 in #1594
[Autotuner] Add FROM_BEST_AVAILABLE initial population strategy by @fulvius31 in #1365
Unwrap single-element list in hl.tile() to match scalar behavior by @blake-snc in #1559
Enable test_examples by @oulgen in #1598
Generalize autotuning infrastructure to support Pallas/TPU backend by @oulgen in #1591
Fix symbolic variable specialization when indexing tensor in host block by @yf225 in #1575
Add CLAUDE.md as symlink to AGENTS.md by @oulgen in #1610
update grouped gemm tb signature by @v0i0 in #1608
Update logo and add events for hackathon and PLDI tutorial by @choijon5 in #1611
Add claude skill for TPU development by @oulgen in #1612
Enable test random and rnd by @oulgen in #1607
[Helion + torch.compile] Temporarily skip torch.compile fusion test cases to wait for cross-repo changes by @yf225 in #1613
add ci health check (hopefully catches cudaDeviceUnavailable) by @v0i0 in #1614
[Helion + torch.compile] Skip test_symint_return_from_tensor_shape temporarily by @yf225 in #1615
update rocm in benchmark now that we bumped pytorch by @v0i0 in #1609
Remove Triton CPU backend and all CPU references by @oulgen in #1616
[Internal CI] Use MTIA-aligned tensor shape in test_tile_single_element_list by @yf225 in #1618
Refactor device_ir passes by @jansel in #1606
Add @skipIfCudaSharedMemoryLessThan for failing RTX5090 tests by @jansel in #1617
another benchmark parsing fix by @v0i0 in #1621
Disable forking to fix failing test by @jansel in #1619
Support for Advanced Control Files in Autotuner and Configs by @ptorru in #1576
Fix scalar tensor indexing crash by @hinriksnaer in #1620
[Docs] Update autotuner and TileIR backend documentation by @fulvius31 in #1600
benchmarking all_gather_matmul by @shunting314 in #1605
Pin H100 nightly and skip flakey pallas test by @jansel in #1624
[CI] Relax tolerances in test_hl_arange_non_power_of_2 by @yf225 in #1627
Decrease sizes for test_batch_softmax_block_ptr by @jansel in #1628
fix dynamic shapes handling tensor descriptors conservatively by @v0i0 in #1604
[Docs] Fix Helion repo link by @choijon5 in #1629
Rename Helion Puzzles to Tutorials; fix broken tutorial examples; move to Markdown by @yf225 in #1631
fix size hint lint from nightly by @v0i0 in #1630
Introduce MTIA autotuning knobs by @kile01 in #1572
change to gfx942.1 runner like pytorch, hopefully shorter queue by @v0i0 in #1632
Set TRITON_STORE_BINARY_ONLY=1 during autotuning to reduce cache size by @fulvius31 in #1590
Enable fast triton sigmoid by @hinriksnaer in #1564
Fix fake impl inference with unbacked SymInts by @gmagogsfm in #1626
[pallas-tpu] Add HALF_DTYPE constant for backend-portable half-precision tests by @v0i0 in #1636
add gpu health check to tests as well by @v0i0 in #1634
Properly generate floor division using // for constexpr by @PaulZhang12 in #1625
Add --file-header option for custom headers in AOT generated files by @v0i0 in #1645
[pallas-tpu] Bump libtpu to 0.0.37, jax/jaxlib to 0.9.1 by @v0i0 in #1641
[pallas-tpu] Enable pallas backend for view/reshape tests in test_views.py by @v0i0 in #1638
fix a10g lerp nightly error by @v0i0 in #1652
[pallas-tpu] remove shape based block specs by @v0i0 in #1633
Fix lerp decomp in _make_fx for PyTorch nightly by @v0i0 in #1653
[pallas-tpu] Fix BlockSpec regressions from codegen BlockSpecs PR by @v0i0 in #1656
aot tuning standalone output by @v0i0 in #1650
[pallas-tpu] Enable pallas backend for test_loops.py TestLoops class by @v0i0 in #1639
[pallas-tpu] Switch float16 to HALF_DTYPE in examples for TPU compatibility by @v0i0 in #1648

New Contributors

@malfet made their first contribution in #1550
@blake-snc made their first contribution in #1559
@ptorru made their first contribution in #1576
@kile01 made their first contribution in #1572

Full Changelog: v0.3.0...v0.3.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!