v0.1.8
What's Changed
- fix rmsnorm fwd tritonbench by @v0i0 in #840
- Update input shapes for example kernels by @yf225 in #845
- Extend eviction policy tests to all indexing types by @oulgen in #833
- [Docs] Remove early development warning by @oulgen in #846
- [Docs] Add link to gpumode discord by @oulgen in #847
- [Docs] Add PTC promotional material by @oulgen in #848
- [Benchmark] Add low mem dropout example by @karthickai in #641
- Update lint.yml by @oulgen in #854
- Remove
hl.register_reduction_dimAPI by @yf225 in #834 - Error message for boolean masking or torch.nonzero by @yf225 in #687
- Remove hardcoded
block_size=1usage in attention kernel example by @yf225 in #843 - Revert "Update to use the new attribute setting for tf32." by @choijon5 in #856
- Decrease
num_stagesdefault from 3 to 2, to avoid shared memory OOM by @yf225 in #841 - Allow user-defined specialization key by @jansel in #853
- [Benchmark CI] Use fewer num_inputs for flash_attention to avoid timeout by @yf225 in #857
- Remove legacy
register_inductor_loweringcode by @yf225 in #864 - Set setstate/getstate methods to Config by @jansel in #868
- [doc] Add deployment/autotuning guide by @jansel in #869
- [Benchmark CI] Use equally-spaced-k mode to sample input shapes by @yf225 in #861
- Fix sphinx warnings by @jansel in #871
- Normalize tl.sqrt and libdevice.sqrt for tests by @oulgen in #866
- [CI] Pin py3.10 and one py3.12 on pytorch2.9 by @oulgen in #858
- [Docs] Suggest PyTorch 2.9 or above by @oulgen in #859
- [Benchmark] Pin benchmarks to PyTorch 2.9 by @oulgen in #860
- Print Triton code when error for easier debugging by @yf225 in #874
- Terminate autotuning faster if progress is minimal by @oulgen in #855
- Update README.md by @oulgen in #877
- [CI] pin b200 to pytorch2.9 by @oulgen in #878
- [Autotuner] Run CUDA synchronize before / after candidate func call, to surface CUDA errors sooner by @yf225 in #872
- [Benchmark] bf16 x int16 helion kernel by @karthickai in #794
- Install git for benchmarks by @oulgen in #882
- Pin AMD to 6.4.4 by @oulgen in #883
- Faster int4 gemm by @PaulZhang12 in #751
- Pin AMD to 6.4.4 by @oulgen in #881
- Remove PyTorch requirement from deps so that it is easier to install arbitrary version of pytorch by @oulgen in #879
- [Benchmark CI] Use regular matmul instead of split-k by @yf225 in #884
- [Benchmark] Use bespoke setup-python action by @oulgen in #885
- [Benchmark] Drop memory bound kernels and replace them with gemms by @oulgen in #887
- Add dependabot by @oulgen in #888
- Update dependabot.yml by @oulgen in #891
- chore: Bump actions/setup-python from 5 to 6 by @dependabot[bot] in #893
- chore: Bump actions/download-artifact from 4 to 5 by @dependabot[bot] in #895
- chore: Bump actions/upload-pages-artifact from 3 to 4 by @dependabot[bot] in #894
- chore: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #892
- Upgrade ruff==0.14.0 by @jansel in #889
- [Benchmark CI] grouped_gemm: include input preproc in timing measurement; update gemm backend name mapping by @yf225 in #898
- chore: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #896
- [Benchmark] use logger.exception for process errors by @oulgen in #902
- [Benchmark CI] Reduce num_inputs for grouped_gemm and gemm benchmarks by @yf225 in #903
- Query minimum dot size for XPU by @EikanWang in #900
- Add matmul/addmm bwd examples and add test coverage by @tianrengao in #748
- [CI] Pin amd to rocm7.0 by @oulgen in #907
- [Benchmark] Move benchmark kernel sharding to dispatch by @oulgen in #905
- [Benchmark] Provide a way to pass custom list of kernels by @oulgen in #906
- [Benchmark CI] Use triton_tutorial_matmul for triton matmul baseline by @yf225 in #911
- Remove cache around set_triton_allocator by @oulgen in #912
- Add int4_gemm by @oulgen in #917
- chore: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #916
- Catch missing cudnn error by @jansel in #873
- Add progress bar for precompiling by @jansel in #919
- Adding new setting, autotune_effort=[none/quick/full] by @choijon5 in #913
- Print error message for torch.chunk / torch.unbind to redirect users to hl.split by @yf225 in #921
- Avoid setting default
--input-sample-modetoequally-spaced-kby @yf225 in #922 - Remove
triton_helpers.*usage in lifted device function arguments by @yf225 in #849 - Set HELION_DEV_LOW_VRAM=1 on a10g CI machines by @yf225 in #923
- Suggest use of
@helion.kernel(index_dtype=torch.int64)if index offset is out of bound for int32 by @yf225 in #850 - Deprecate use_default_config and replace all its uses with autotune_effort by @choijon5 in #924
- Support
hl.arange()with non-power-of-2 input by @yf225 in #862 - Setting up RunLLm AI Chatbot by @sekyondaMeta in #925
- Generalize examples with the DEVICE variable by @adam-smnk in #915
- Fix lint error by @jansel in #926
- Add lint to make sure examples and tests use device=DEVICE by @oulgen in #929
- Support tile+offset and tensor descriptors by @jansel in #928
- Fix triton/torch.compile compability issue by @jansel in #927
- Fix CUDA IMA from combination of unrolling + pipelining by @PaulZhang12 in #920
- Update the Agent ID by @sekyondaMeta in #931
- [Benchmark CI] Use
--non-squareflag for gemm by @yf225 in #938
New Contributors
- @dependabot[bot] made their first contribution in #893
- @tianrengao made their first contribution in #748
Full Changelog: v0.1.7...v0.1.8