Skip to content

v0.3.1

Choose a tag to compare

@oulgen oulgen released this 12 Mar 21:23
· 751 commits to main since this release
4c4b33b

What's Changed

  • [Autotuner] Use chunked comparison in autotuner accuracy checks to reduce peak memory by @yf225 in #1538
  • one_shot_allreduce_bias_rmsnorm: alloc&rendezvous symm memory only once by @shunting314 in #1525
  • Add Horace as Core Maintainer by @oulgen in #1552
  • add two-shot all redcue rms norm kernel by @shunting314 in #1526
  • Update bazel build command to use remote caching by @malfet in #1550
  • Add BlockSpec support and proper indexing to Pallas launcher by @oulgen in #1548
  • Add Pallas codegen for prims.iota by @oulgen in #1556
  • [CI] Fix AMD error on test_default_block_sizes_high_dim_with_reduction by @yf225 in #1546
  • [Autotuner] Eliminate self._original_args clone to reduce peak memory by @yf225 in #1547
  • sync bench/rebench result across ranks by @shunting314 in #1542
  • Skip failing test on MTIA by @Myrthan in #1558
  • [Cache] Defer some initialization in autotuner to skip unnecessary work on cache hits by @fulvius31 in #1557
  • Enable Pallas backend for test_control_flow with lax.cond by @oulgen in #1554
  • Add Pallas atomic ops -- everything is atomic by default by @oulgen in #1560
  • Add Pallas emit_pipeline by @oulgen in #1561
  • Move do_bench/sync_object from _testing to autotuner/benchmarking to fix docs build by @oulgen in #1562
  • Improve launcher overhead by @oulgen in #1563
  • fix mock in debug utils - hopefully make ci less flaky by @v0i0 in #1568
  • fix fp8 autotuner fail for ci by @v0i0 in #1569
  • fix some indexing by @v0i0 in #1453
  • chore: Bump actions/upload-artifact from 6 to 7 by @dependabot[bot] in #1579
  • chore: Bump actions/download-artifact from 7 to 8 by @dependabot[bot] in #1580
  • Build TorchTPU in opt mode by @oulgen in #1573
  • Add Pallas fori_loop with async DMA and pallas_loop_type config by @oulgen in #1574
  • Enable waves_per_eu tunable on RDNA GPUs by @fulvius31 in #1571
  • [cutedsl] Cross-warp reductions by @jansel in #1543
  • sync seeds across ranks by @shunting314 in #1555
  • [cutedsl] Add more reduction types by @jansel in #1565
  • [cutedsl] Tuple reductions by @jansel in #1566
  • Fix CPU tests by @jansel in #1582
  • Add Pallas reduction support and enable test_reductions for TPU by @oulgen in #1549
  • Tell AGENTS to not run git push by @oulgen in #1592
  • partition aot tuning by config validity by @v0i0 in #1581
  • bump pytorch to 2.10 in benchmark nightly to unblock grouped_gemm tritonbench by @v0i0 in #1597
  • fix benchmark parsing by @v0i0 in #1596
  • make sure persistent setup code uses index type for total pids by @v0i0 in #1595
  • Fix issue with branching and static_range by @jansel in #1586
  • Update AGENTS.md assertExpectedJournal reference by @jansel in #1587
  • [cutedsl] Enable more test files by @jansel in #1583
  • [cutedsl] Fix issue with sympy printer by @jansel in #1585
  • [Cache] Fix cross-backend cache poisoning by adding backend to cache key by @fulvius31 in #1593
  • [Helion + torch.compile] Add additional inductor fusion tests by @yf225 in #1594
  • [Autotuner] Add FROM_BEST_AVAILABLE initial population strategy by @fulvius31 in #1365
  • Unwrap single-element list in hl.tile() to match scalar behavior by @blake-snc in #1559
  • Enable test_examples by @oulgen in #1598
  • Generalize autotuning infrastructure to support Pallas/TPU backend by @oulgen in #1591
  • Fix symbolic variable specialization when indexing tensor in host block by @yf225 in #1575
  • Add CLAUDE.md as symlink to AGENTS.md by @oulgen in #1610
  • update grouped gemm tb signature by @v0i0 in #1608
  • Update logo and add events for hackathon and PLDI tutorial by @choijon5 in #1611
  • Add claude skill for TPU development by @oulgen in #1612
  • Enable test random and rnd by @oulgen in #1607
  • [Helion + torch.compile] Temporarily skip torch.compile fusion test cases to wait for cross-repo changes by @yf225 in #1613
  • add ci health check (hopefully catches cudaDeviceUnavailable) by @v0i0 in #1614
  • [Helion + torch.compile] Skip test_symint_return_from_tensor_shape temporarily by @yf225 in #1615
  • update rocm in benchmark now that we bumped pytorch by @v0i0 in #1609
  • Remove Triton CPU backend and all CPU references by @oulgen in #1616
  • [Internal CI] Use MTIA-aligned tensor shape in test_tile_single_element_list by @yf225 in #1618
  • Refactor device_ir passes by @jansel in #1606
  • Add @skipIfCudaSharedMemoryLessThan for failing RTX5090 tests by @jansel in #1617
  • another benchmark parsing fix by @v0i0 in #1621
  • Disable forking to fix failing test by @jansel in #1619
  • Support for Advanced Control Files in Autotuner and Configs by @ptorru in #1576
  • Fix scalar tensor indexing crash by @hinriksnaer in #1620
  • [Docs] Update autotuner and TileIR backend documentation by @fulvius31 in #1600
  • benchmarking all_gather_matmul by @shunting314 in #1605
  • Pin H100 nightly and skip flakey pallas test by @jansel in #1624
  • [CI] Relax tolerances in test_hl_arange_non_power_of_2 by @yf225 in #1627
  • Decrease sizes for test_batch_softmax_block_ptr by @jansel in #1628
  • fix dynamic shapes handling tensor descriptors conservatively by @v0i0 in #1604
  • [Docs] Fix Helion repo link by @choijon5 in #1629
  • Rename Helion Puzzles to Tutorials; fix broken tutorial examples; move to Markdown by @yf225 in #1631
  • fix size hint lint from nightly by @v0i0 in #1630
  • Introduce MTIA autotuning knobs by @kile01 in #1572
  • change to gfx942.1 runner like pytorch, hopefully shorter queue by @v0i0 in #1632
  • Set TRITON_STORE_BINARY_ONLY=1 during autotuning to reduce cache size by @fulvius31 in #1590
  • Enable fast triton sigmoid by @hinriksnaer in #1564
  • Fix fake impl inference with unbacked SymInts by @gmagogsfm in #1626
  • [pallas-tpu] Add HALF_DTYPE constant for backend-portable half-precision tests by @v0i0 in #1636
  • add gpu health check to tests as well by @v0i0 in #1634
  • Properly generate floor division using // for constexpr by @PaulZhang12 in #1625
  • Add --file-header option for custom headers in AOT generated files by @v0i0 in #1645
  • [pallas-tpu] Bump libtpu to 0.0.37, jax/jaxlib to 0.9.1 by @v0i0 in #1641
  • [pallas-tpu] Enable pallas backend for view/reshape tests in test_views.py by @v0i0 in #1638
  • fix a10g lerp nightly error by @v0i0 in #1652
  • [pallas-tpu] remove shape based block specs by @v0i0 in #1633
  • Fix lerp decomp in _make_fx for PyTorch nightly by @v0i0 in #1653
  • [pallas-tpu] Fix BlockSpec regressions from codegen BlockSpecs PR by @v0i0 in #1656
  • aot tuning standalone output by @v0i0 in #1650
  • [pallas-tpu] Enable pallas backend for test_loops.py TestLoops class by @v0i0 in #1639
  • [pallas-tpu] Switch float16 to HALF_DTYPE in examples for TPU compatibility by @v0i0 in #1648

New Contributors

Full Changelog: v0.3.0...v0.3.1