Skip to content

v0.1.8

Choose a tag to compare

@oulgen oulgen released this 15 Oct 00:37
· 1255 commits to main since this release
b77301f

What's Changed

  • fix rmsnorm fwd tritonbench by @v0i0 in #840
  • Update input shapes for example kernels by @yf225 in #845
  • Extend eviction policy tests to all indexing types by @oulgen in #833
  • [Docs] Remove early development warning by @oulgen in #846
  • [Docs] Add link to gpumode discord by @oulgen in #847
  • [Docs] Add PTC promotional material by @oulgen in #848
  • [Benchmark] Add low mem dropout example by @karthickai in #641
  • Update lint.yml by @oulgen in #854
  • Remove hl.register_reduction_dim API by @yf225 in #834
  • Error message for boolean masking or torch.nonzero by @yf225 in #687
  • Remove hardcoded block_size=1 usage in attention kernel example by @yf225 in #843
  • Revert "Update to use the new attribute setting for tf32." by @choijon5 in #856
  • Decrease num_stages default from 3 to 2, to avoid shared memory OOM by @yf225 in #841
  • Allow user-defined specialization key by @jansel in #853
  • [Benchmark CI] Use fewer num_inputs for flash_attention to avoid timeout by @yf225 in #857
  • Remove legacy register_inductor_lowering code by @yf225 in #864
  • Set setstate/getstate methods to Config by @jansel in #868
  • [doc] Add deployment/autotuning guide by @jansel in #869
  • [Benchmark CI] Use equally-spaced-k mode to sample input shapes by @yf225 in #861
  • Fix sphinx warnings by @jansel in #871
  • Normalize tl.sqrt and libdevice.sqrt for tests by @oulgen in #866
  • [CI] Pin py3.10 and one py3.12 on pytorch2.9 by @oulgen in #858
  • [Docs] Suggest PyTorch 2.9 or above by @oulgen in #859
  • [Benchmark] Pin benchmarks to PyTorch 2.9 by @oulgen in #860
  • Print Triton code when error for easier debugging by @yf225 in #874
  • Terminate autotuning faster if progress is minimal by @oulgen in #855
  • Update README.md by @oulgen in #877
  • [CI] pin b200 to pytorch2.9 by @oulgen in #878
  • [Autotuner] Run CUDA synchronize before / after candidate func call, to surface CUDA errors sooner by @yf225 in #872
  • [Benchmark] bf16 x int16 helion kernel by @karthickai in #794
  • Install git for benchmarks by @oulgen in #882
  • Pin AMD to 6.4.4 by @oulgen in #883
  • Faster int4 gemm by @PaulZhang12 in #751
  • Pin AMD to 6.4.4 by @oulgen in #881
  • Remove PyTorch requirement from deps so that it is easier to install arbitrary version of pytorch by @oulgen in #879
  • [Benchmark CI] Use regular matmul instead of split-k by @yf225 in #884
  • [Benchmark] Use bespoke setup-python action by @oulgen in #885
  • [Benchmark] Drop memory bound kernels and replace them with gemms by @oulgen in #887
  • Add dependabot by @oulgen in #888
  • Update dependabot.yml by @oulgen in #891
  • chore: Bump actions/setup-python from 5 to 6 by @dependabot[bot] in #893
  • chore: Bump actions/download-artifact from 4 to 5 by @dependabot[bot] in #895
  • chore: Bump actions/upload-pages-artifact from 3 to 4 by @dependabot[bot] in #894
  • chore: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #892
  • Upgrade ruff==0.14.0 by @jansel in #889
  • [Benchmark CI] grouped_gemm: include input preproc in timing measurement; update gemm backend name mapping by @yf225 in #898
  • chore: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #896
  • [Benchmark] use logger.exception for process errors by @oulgen in #902
  • [Benchmark CI] Reduce num_inputs for grouped_gemm and gemm benchmarks by @yf225 in #903
  • Query minimum dot size for XPU by @EikanWang in #900
  • Add matmul/addmm bwd examples and add test coverage by @tianrengao in #748
  • [CI] Pin amd to rocm7.0 by @oulgen in #907
  • [Benchmark] Move benchmark kernel sharding to dispatch by @oulgen in #905
  • [Benchmark] Provide a way to pass custom list of kernels by @oulgen in #906
  • [Benchmark CI] Use triton_tutorial_matmul for triton matmul baseline by @yf225 in #911
  • Remove cache around set_triton_allocator by @oulgen in #912
  • Add int4_gemm by @oulgen in #917
  • chore: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #916
  • Catch missing cudnn error by @jansel in #873
  • Add progress bar for precompiling by @jansel in #919
  • Adding new setting, autotune_effort=[none/quick/full] by @choijon5 in #913
  • Print error message for torch.chunk / torch.unbind to redirect users to hl.split by @yf225 in #921
  • Avoid setting default --input-sample-mode to equally-spaced-k by @yf225 in #922
  • Remove triton_helpers.* usage in lifted device function arguments by @yf225 in #849
  • Set HELION_DEV_LOW_VRAM=1 on a10g CI machines by @yf225 in #923
  • Suggest use of @helion.kernel(index_dtype=torch.int64) if index offset is out of bound for int32 by @yf225 in #850
  • Deprecate use_default_config and replace all its uses with autotune_effort by @choijon5 in #924
  • Support hl.arange() with non-power-of-2 input by @yf225 in #862
  • Setting up RunLLm AI Chatbot by @sekyondaMeta in #925
  • Generalize examples with the DEVICE variable by @adam-smnk in #915
  • Fix lint error by @jansel in #926
  • Add lint to make sure examples and tests use device=DEVICE by @oulgen in #929
  • Support tile+offset and tensor descriptors by @jansel in #928
  • Fix triton/torch.compile compability issue by @jansel in #927
  • Fix CUDA IMA from combination of unrolling + pipelining by @PaulZhang12 in #920
  • Update the Agent ID by @sekyondaMeta in #931
  • [Benchmark CI] Use --non-square flag for gemm by @yf225 in #938

New Contributors

Full Changelog: v0.1.7...v0.1.8