Skip to content

v0.3.3

Choose a tag to compare

@oulgen oulgen released this 24 Mar 20:39
· 691 commits to main since this release
d4003c1

What's Changed

  • bump cublas for b200 benchmarks by @v0i0 in #1666
  • Remove need to pass device=... into torch functions by @hinriksnaer in #1657
  • [docs] Updated docs to communicate fast math support by @hinriksnaer in #1669
  • Fix #762: Replace internal assertion with user-facing error in set_pid by @tianrengao in #1670
  • [Helion + torch.compile] Add expected_num_kernels validation to torch.compile tests by @yf225 in #1655
  • [pallas-tpu] Fix segfault xfails: softmax tests now pass, reclassify others by @v0i0 in #1721
  • Reject data-dependent output shapes in infer_output_spec by @gmagogsfm in #1722
  • [pallas-tpu] Add test_long_sum_manual to verify range bound fix by @v0i0 in #1732
  • [pallas-tpu] fix default configs for TPU examples by @v0i0 in #1731
  • Move Triton-specific implementations from Backend base class to TritonBackend. by @norx1991 in #1728
  • Fix lint in ref_mode.py by @jansel in #1681
  • Add missing rebenchmark, finishing phase, and effort profile wiring to DESurrogateHybrid by @fulvius31 in #1680
  • Faster expected generation in test_indexing.py by @jansel in #1682
  • [cutedsl] Implement layout planning phase by @jansel in #1664
  • Revert "[pallas-tpu] fix default configs for TPU examples (#1731)" by @norx1991 in #1740
  • [pallas-tpu] Fix Pallas test_add by making non-contiguous inputs contiguous in pallas launcher. by @norx1991 in #1737
  • [Helion + torch.compile] Refactor HelionTemplateBuffer to use TemplateBuffer base class by @yf225 in #1723
  • precompile in the current process by @shunting314 in #1730
  • [CI] Fix Pyrefly lint error in template_buffer.py by @yf225 in #1746
  • [Autotuner] Add autotune_baseline_accuracy_check_fn for custom accuracy checks by @yf225 in #1733
  • Add Dockerfile by @jansel in #1748
  • Add scripts/runpod.py by @jansel in #1749
  • [cutedls] Initial mma support by @jansel in #1742
  • Fix doubled test output in non-distributed CI jobs by @norx1991 in #1741
  • Add hl.jagged_tile by @nullplay in #1651
  • Fix logging to be compatible with pytest by @bringlein in #1734
  • Add autotune_initial_population_strategy kernel setting by @bringlein in #1735
  • [CI] Increase atol for test_squeeze_and_excitation_net_fwd on B200 by @yf225 in #1752
  • Unpin H100 nightly torch and Triton versions by @v0i0 in #1654
  • Enable pyrefly on macOS with ignore-missing-imports by @aditvenk in #1760
  • [metal] Register "metal" backend with minimal MetalBackend and launcher by @aditvenk in #1761
  • [metal] Respect force_tile_mask() in NDTileStrategy mask generation by @aditvenk in #1762
  • Remove past hackathon event from README by @choijon5 in #1780
  • Remove deadcode _clone_tree and _assert_args_close by @choijon5 in #1781
  • Fix PermutationFragment.encode() returning wrong value by @choijon5 in #1779
  • Fix atomic_max ref to return previous value by @choijon5 in #1782
  • Fix lints by @jansel in #1775
  • Add runpod SKILL.md by @jansel in #1776
  • [Helion + torch.compile] Add store/load transform hooks and prologue/epilogue fusion codegen by @yf225 in #1724
  • [Helion + torch.compile] Enable torch.compile fusion tests by @yf225 in #1727
  • [Helion + torch.compile] Simplify _remap_or_resolve for compound sympy expressions by @yf225 in #1785
  • Add scheduled workflow to rerun GPU health check failures by @v0i0 in #1683
  • Fix device sync in generic benchmarking functions for TPU/Pallas by @norx1991 in #1773
  • Removing skips and in some cases adding skipIfNotCUDA for cuda only features. by @umechand-amd in #1790
  • APIs to debug distributed kernel by @shunting314 in #1743
  • [cutedsl] tcgen05 MMA support by @jansel in #1777
  • [cutedsl] Fix broadcast and reshape-backed matmul lowering by @jansel in #1783
  • [cutedsl] Fix packed-RHS lowering and add general stack/reshape views by @jansel in #1784
  • Fix ROCm failures on main by @jansel in #1805
  • add kernel-filter to select kernel for allreduce-rmsnorm by @shunting314 in #1744
  • helion distributed kernel autotuning by @shunting314 in #1532
  • [Helion + torch.compile] Add ref baseline kernel count checks by @yf225 in #1786

New Contributors

Full Changelog: v0.3.2...v0.3.3