v0.3.3
What's Changed
- bump cublas for b200 benchmarks by @v0i0 in #1666
- Remove need to pass
device=...into torch functions by @hinriksnaer in #1657 - [docs] Updated docs to communicate fast math support by @hinriksnaer in #1669
- Fix #762: Replace internal assertion with user-facing error in set_pid by @tianrengao in #1670
- [Helion + torch.compile] Add expected_num_kernels validation to torch.compile tests by @yf225 in #1655
- [pallas-tpu] Fix segfault xfails: softmax tests now pass, reclassify others by @v0i0 in #1721
- Reject data-dependent output shapes in infer_output_spec by @gmagogsfm in #1722
- [pallas-tpu] Add test_long_sum_manual to verify range bound fix by @v0i0 in #1732
- [pallas-tpu] fix default configs for TPU examples by @v0i0 in #1731
- Move Triton-specific implementations from Backend base class to TritonBackend. by @norx1991 in #1728
- Fix lint in ref_mode.py by @jansel in #1681
- Add missing rebenchmark, finishing phase, and effort profile wiring to DESurrogateHybrid by @fulvius31 in #1680
- Faster expected generation in test_indexing.py by @jansel in #1682
- [cutedsl] Implement layout planning phase by @jansel in #1664
- Revert "[pallas-tpu] fix default configs for TPU examples (#1731)" by @norx1991 in #1740
- [pallas-tpu] Fix Pallas test_add by making non-contiguous inputs contiguous in pallas launcher. by @norx1991 in #1737
- [Helion + torch.compile] Refactor HelionTemplateBuffer to use TemplateBuffer base class by @yf225 in #1723
- precompile in the current process by @shunting314 in #1730
- [CI] Fix Pyrefly lint error in template_buffer.py by @yf225 in #1746
- [Autotuner] Add
autotune_baseline_accuracy_check_fnfor custom accuracy checks by @yf225 in #1733 - Add Dockerfile by @jansel in #1748
- Add scripts/runpod.py by @jansel in #1749
- [cutedls] Initial mma support by @jansel in #1742
- Fix doubled test output in non-distributed CI jobs by @norx1991 in #1741
- Add hl.jagged_tile by @nullplay in #1651
- Fix logging to be compatible with pytest by @bringlein in #1734
- Add autotune_initial_population_strategy kernel setting by @bringlein in #1735
- [CI] Increase atol for test_squeeze_and_excitation_net_fwd on B200 by @yf225 in #1752
- Unpin H100 nightly torch and Triton versions by @v0i0 in #1654
- Enable pyrefly on macOS with ignore-missing-imports by @aditvenk in #1760
- [metal] Register "metal" backend with minimal MetalBackend and launcher by @aditvenk in #1761
- [metal] Respect force_tile_mask() in NDTileStrategy mask generation by @aditvenk in #1762
- Remove past hackathon event from README by @choijon5 in #1780
- Remove deadcode _clone_tree and _assert_args_close by @choijon5 in #1781
- Fix PermutationFragment.encode() returning wrong value by @choijon5 in #1779
- Fix atomic_max ref to return previous value by @choijon5 in #1782
- Fix lints by @jansel in #1775
- Add runpod SKILL.md by @jansel in #1776
- [Helion + torch.compile] Add store/load transform hooks and prologue/epilogue fusion codegen by @yf225 in #1724
- [Helion + torch.compile] Enable torch.compile fusion tests by @yf225 in #1727
- [Helion + torch.compile] Simplify _remap_or_resolve for compound sympy expressions by @yf225 in #1785
- Add scheduled workflow to rerun GPU health check failures by @v0i0 in #1683
- Fix device sync in generic benchmarking functions for TPU/Pallas by @norx1991 in #1773
- Removing skips and in some cases adding skipIfNotCUDA for cuda only features. by @umechand-amd in #1790
- APIs to debug distributed kernel by @shunting314 in #1743
- [cutedsl] tcgen05 MMA support by @jansel in #1777
- [cutedsl] Fix broadcast and reshape-backed matmul lowering by @jansel in #1783
- [cutedsl] Fix packed-RHS lowering and add general stack/reshape views by @jansel in #1784
- Fix ROCm failures on main by @jansel in #1805
- add kernel-filter to select kernel for allreduce-rmsnorm by @shunting314 in #1744
- helion distributed kernel autotuning by @shunting314 in #1532
- [Helion + torch.compile] Add ref baseline kernel count checks by @yf225 in #1786
New Contributors
Full Changelog: v0.3.2...v0.3.3