Release v0.1.8 · pytorch/helion

What's Changed

fix rmsnorm fwd tritonbench by @v0i0 in #840
Update input shapes for example kernels by @yf225 in #845
Extend eviction policy tests to all indexing types by @oulgen in #833
[Docs] Remove early development warning by @oulgen in #846
[Docs] Add link to gpumode discord by @oulgen in #847
[Docs] Add PTC promotional material by @oulgen in #848
[Benchmark] Add low mem dropout example by @karthickai in #641
Update lint.yml by @oulgen in #854
Remove hl.register_reduction_dim API by @yf225 in #834
Error message for boolean masking or torch.nonzero by @yf225 in #687
Remove hardcoded block_size=1 usage in attention kernel example by @yf225 in #843
Revert "Update to use the new attribute setting for tf32." by @choijon5 in #856
Decrease num_stages default from 3 to 2, to avoid shared memory OOM by @yf225 in #841
Allow user-defined specialization key by @jansel in #853
[Benchmark CI] Use fewer num_inputs for flash_attention to avoid timeout by @yf225 in #857
Remove legacy register_inductor_lowering code by @yf225 in #864
Set setstate/getstate methods to Config by @jansel in #868
[doc] Add deployment/autotuning guide by @jansel in #869
[Benchmark CI] Use equally-spaced-k mode to sample input shapes by @yf225 in #861
Fix sphinx warnings by @jansel in #871
Normalize tl.sqrt and libdevice.sqrt for tests by @oulgen in #866
[CI] Pin py3.10 and one py3.12 on pytorch2.9 by @oulgen in #858
[Docs] Suggest PyTorch 2.9 or above by @oulgen in #859
[Benchmark] Pin benchmarks to PyTorch 2.9 by @oulgen in #860
Print Triton code when error for easier debugging by @yf225 in #874
Terminate autotuning faster if progress is minimal by @oulgen in #855
Update README.md by @oulgen in #877
[CI] pin b200 to pytorch2.9 by @oulgen in #878
[Autotuner] Run CUDA synchronize before / after candidate func call, to surface CUDA errors sooner by @yf225 in #872
[Benchmark] bf16 x int16 helion kernel by @karthickai in #794
Install git for benchmarks by @oulgen in #882
Pin AMD to 6.4.4 by @oulgen in #883
Faster int4 gemm by @PaulZhang12 in #751
Pin AMD to 6.4.4 by @oulgen in #881
Remove PyTorch requirement from deps so that it is easier to install arbitrary version of pytorch by @oulgen in #879
[Benchmark CI] Use regular matmul instead of split-k by @yf225 in #884
[Benchmark] Use bespoke setup-python action by @oulgen in #885
[Benchmark] Drop memory bound kernels and replace them with gemms by @oulgen in #887
Add dependabot by @oulgen in #888
Update dependabot.yml by @oulgen in #891
chore: Bump actions/setup-python from 5 to 6 by @dependabot[bot] in #893
chore: Bump actions/download-artifact from 4 to 5 by @dependabot[bot] in #895
chore: Bump actions/upload-pages-artifact from 3 to 4 by @dependabot[bot] in #894
chore: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #892
Upgrade ruff==0.14.0 by @jansel in #889
[Benchmark CI] grouped_gemm: include input preproc in timing measurement; update gemm backend name mapping by @yf225 in #898
chore: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #896
[Benchmark] use logger.exception for process errors by @oulgen in #902
[Benchmark CI] Reduce num_inputs for grouped_gemm and gemm benchmarks by @yf225 in #903
Query minimum dot size for XPU by @EikanWang in #900
Add matmul/addmm bwd examples and add test coverage by @tianrengao in #748
[CI] Pin amd to rocm7.0 by @oulgen in #907
[Benchmark] Move benchmark kernel sharding to dispatch by @oulgen in #905
[Benchmark] Provide a way to pass custom list of kernels by @oulgen in #906
[Benchmark CI] Use triton_tutorial_matmul for triton matmul baseline by @yf225 in #911
Remove cache around set_triton_allocator by @oulgen in #912
Add int4_gemm by @oulgen in #917
chore: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #916
Catch missing cudnn error by @jansel in #873
Add progress bar for precompiling by @jansel in #919
Adding new setting, autotune_effort=[none/quick/full] by @choijon5 in #913
Print error message for torch.chunk / torch.unbind to redirect users to hl.split by @yf225 in #921
Avoid setting default --input-sample-mode to equally-spaced-k by @yf225 in #922
Remove triton_helpers.* usage in lifted device function arguments by @yf225 in #849
Set HELION_DEV_LOW_VRAM=1 on a10g CI machines by @yf225 in #923
Suggest use of @helion.kernel(index_dtype=torch.int64) if index offset is out of bound for int32 by @yf225 in #850
Deprecate use_default_config and replace all its uses with autotune_effort by @choijon5 in #924
Support hl.arange() with non-power-of-2 input by @yf225 in #862
Setting up RunLLm AI Chatbot by @sekyondaMeta in #925
Generalize examples with the DEVICE variable by @adam-smnk in #915
Fix lint error by @jansel in #926
Add lint to make sure examples and tests use device=DEVICE by @oulgen in #929
Support tile+offset and tensor descriptors by @jansel in #928
Fix triton/torch.compile compability issue by @jansel in #927
Fix CUDA IMA from combination of unrolling + pipelining by @PaulZhang12 in #920
Update the Agent ID by @sekyondaMeta in #931
[Benchmark CI] Use --non-square flag for gemm by @yf225 in #938

New Contributors

@dependabot[bot] made their first contribution in #893
@tianrengao made their first contribution in #748

Full Changelog: v0.1.7...v0.1.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.8

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!