v0.3.1
What's Changed
- [Autotuner] Use chunked comparison in autotuner accuracy checks to reduce peak memory by @yf225 in #1538
- one_shot_allreduce_bias_rmsnorm: alloc&rendezvous symm memory only once by @shunting314 in #1525
- Add Horace as Core Maintainer by @oulgen in #1552
- add two-shot all redcue rms norm kernel by @shunting314 in #1526
- Update bazel build command to use remote caching by @malfet in #1550
- Add BlockSpec support and proper indexing to Pallas launcher by @oulgen in #1548
- Add Pallas codegen for prims.iota by @oulgen in #1556
- [CI] Fix AMD error on
test_default_block_sizes_high_dim_with_reductionby @yf225 in #1546 - [Autotuner] Eliminate
self._original_argsclone to reduce peak memory by @yf225 in #1547 - sync bench/rebench result across ranks by @shunting314 in #1542
- Skip failing test on MTIA by @Myrthan in #1558
- [Cache] Defer some initialization in autotuner to skip unnecessary work on cache hits by @fulvius31 in #1557
- Enable Pallas backend for test_control_flow with lax.cond by @oulgen in #1554
- Add Pallas atomic ops -- everything is atomic by default by @oulgen in #1560
- Add Pallas emit_pipeline by @oulgen in #1561
- Move do_bench/sync_object from _testing to autotuner/benchmarking to fix docs build by @oulgen in #1562
- Improve launcher overhead by @oulgen in #1563
- fix mock in debug utils - hopefully make ci less flaky by @v0i0 in #1568
- fix fp8 autotuner fail for ci by @v0i0 in #1569
- fix some indexing by @v0i0 in #1453
- chore: Bump actions/upload-artifact from 6 to 7 by @dependabot[bot] in #1579
- chore: Bump actions/download-artifact from 7 to 8 by @dependabot[bot] in #1580
- Build TorchTPU in opt mode by @oulgen in #1573
- Add Pallas fori_loop with async DMA and pallas_loop_type config by @oulgen in #1574
- Enable waves_per_eu tunable on RDNA GPUs by @fulvius31 in #1571
- [cutedsl] Cross-warp reductions by @jansel in #1543
- sync seeds across ranks by @shunting314 in #1555
- [cutedsl] Add more reduction types by @jansel in #1565
- [cutedsl] Tuple reductions by @jansel in #1566
- Fix CPU tests by @jansel in #1582
- Add Pallas reduction support and enable test_reductions for TPU by @oulgen in #1549
- Tell AGENTS to not run git push by @oulgen in #1592
- partition aot tuning by config validity by @v0i0 in #1581
- bump pytorch to 2.10 in benchmark nightly to unblock grouped_gemm tritonbench by @v0i0 in #1597
- fix benchmark parsing by @v0i0 in #1596
- make sure persistent setup code uses index type for total pids by @v0i0 in #1595
- Fix issue with branching and static_range by @jansel in #1586
- Update AGENTS.md assertExpectedJournal reference by @jansel in #1587
- [cutedsl] Enable more test files by @jansel in #1583
- [cutedsl] Fix issue with sympy printer by @jansel in #1585
- [Cache] Fix cross-backend cache poisoning by adding backend to cache key by @fulvius31 in #1593
- [Helion + torch.compile] Add additional inductor fusion tests by @yf225 in #1594
- [Autotuner] Add FROM_BEST_AVAILABLE initial population strategy by @fulvius31 in #1365
- Unwrap single-element list in hl.tile() to match scalar behavior by @blake-snc in #1559
- Enable test_examples by @oulgen in #1598
- Generalize autotuning infrastructure to support Pallas/TPU backend by @oulgen in #1591
- Fix symbolic variable specialization when indexing tensor in host block by @yf225 in #1575
- Add CLAUDE.md as symlink to AGENTS.md by @oulgen in #1610
- update grouped gemm tb signature by @v0i0 in #1608
- Update logo and add events for hackathon and PLDI tutorial by @choijon5 in #1611
- Add claude skill for TPU development by @oulgen in #1612
- Enable test random and rnd by @oulgen in #1607
- [Helion + torch.compile] Temporarily skip torch.compile fusion test cases to wait for cross-repo changes by @yf225 in #1613
- add ci health check (hopefully catches cudaDeviceUnavailable) by @v0i0 in #1614
- [Helion + torch.compile] Skip test_symint_return_from_tensor_shape temporarily by @yf225 in #1615
- update rocm in benchmark now that we bumped pytorch by @v0i0 in #1609
- Remove Triton CPU backend and all CPU references by @oulgen in #1616
- [Internal CI] Use MTIA-aligned tensor shape in test_tile_single_element_list by @yf225 in #1618
- Refactor device_ir passes by @jansel in #1606
- Add @skipIfCudaSharedMemoryLessThan for failing RTX5090 tests by @jansel in #1617
- another benchmark parsing fix by @v0i0 in #1621
- Disable forking to fix failing test by @jansel in #1619
- Support for Advanced Control Files in Autotuner and Configs by @ptorru in #1576
- Fix scalar tensor indexing crash by @hinriksnaer in #1620
- [Docs] Update autotuner and TileIR backend documentation by @fulvius31 in #1600
- benchmarking all_gather_matmul by @shunting314 in #1605
- Pin H100 nightly and skip flakey pallas test by @jansel in #1624
- [CI] Relax tolerances in
test_hl_arange_non_power_of_2by @yf225 in #1627 - Decrease sizes for test_batch_softmax_block_ptr by @jansel in #1628
- fix dynamic shapes handling tensor descriptors conservatively by @v0i0 in #1604
- [Docs] Fix Helion repo link by @choijon5 in #1629
- Rename Helion Puzzles to Tutorials; fix broken tutorial examples; move to Markdown by @yf225 in #1631
- fix size hint lint from nightly by @v0i0 in #1630
- Introduce MTIA autotuning knobs by @kile01 in #1572
- change to gfx942.1 runner like pytorch, hopefully shorter queue by @v0i0 in #1632
- Set TRITON_STORE_BINARY_ONLY=1 during autotuning to reduce cache size by @fulvius31 in #1590
- Enable fast triton sigmoid by @hinriksnaer in #1564
- Fix fake impl inference with unbacked SymInts by @gmagogsfm in #1626
- [pallas-tpu] Add HALF_DTYPE constant for backend-portable half-precision tests by @v0i0 in #1636
- add gpu health check to tests as well by @v0i0 in #1634
- Properly generate floor division using // for constexpr by @PaulZhang12 in #1625
- Add --file-header option for custom headers in AOT generated files by @v0i0 in #1645
- [pallas-tpu] Bump libtpu to 0.0.37, jax/jaxlib to 0.9.1 by @v0i0 in #1641
- [pallas-tpu] Enable pallas backend for view/reshape tests in test_views.py by @v0i0 in #1638
- fix a10g lerp nightly error by @v0i0 in #1652
- [pallas-tpu] remove shape based block specs by @v0i0 in #1633
- Fix lerp decomp in _make_fx for PyTorch nightly by @v0i0 in #1653
- [pallas-tpu] Fix BlockSpec regressions from codegen BlockSpecs PR by @v0i0 in #1656
- aot tuning standalone output by @v0i0 in #1650
- [pallas-tpu] Enable pallas backend for test_loops.py TestLoops class by @v0i0 in #1639
- [pallas-tpu] Switch float16 to HALF_DTYPE in examples for TPU compatibility by @v0i0 in #1648
New Contributors
- @malfet made their first contribution in #1550
- @blake-snc made their first contribution in #1559
- @ptorru made their first contribution in #1576
- @kile01 made their first contribution in #1572
Full Changelog: v0.3.0...v0.3.1