Release v0.1.11 · tile-ai/tilelang

What's Changed

Fix atomic_load access_ptr lowering for dynamic indices by @VitalyAnkh in #2157
[Example] Add CLC-pipelined 2-CTA GEMM example for sm100 by @ighoshsubho in #2169
[Feature] Add thread_extent parameter to T.tma_copy for flexible TMA copy by @Rachmanino in #2205
Optimize disk cache source loading by @sepcnt in #2176
[CuTeDSL] Lower handle_add_byte_offset in Python codegen by @JayceSu98 in #2261
[FFI][Host] Refactor packed API binder to use FFI asserts by @LeiWang1999 in #2263
[TileOP] Add scan operators by @LeiWang1999 in #2262
[Feature] Add CUDA __ffs intrinsic for bit manipulation by @Rachmanino in #2264
[Bugfix] Fix cached source restore and Metal codegen fallback by @LeiWang1999 in #2266
[CuTeDSL] Represent tfloat32 storage as Float32 by @JayceSu98 in #2268
[Feature] Support named barrier arrive by @Rachmanino in #2194
[BugFix][Examples] Align grouped GEMM backward runner arguments by @JayceSu98 in #2275
[CUDA][Reduce] Fix packed mixed-dtype reduce casts by @LeiWang1999 in #2276
[Pipeline] Refactor software pipeline transforms by @LeiWang1999 in #2245
[Transform] Rewrite MergeSharedMemoryAllocations with per-epoch liveness by @TensorGlue-IEIT in #2185
[Windows] Gate libtvm compatibility symlinks to Unix by @LeiWang1999 in #2273
[TIR][Transform] Revert per-epoch shared memory liveness by @LeiWang1999 in #2281
[Transform][Pipeline] Keep pointer binds out of replayable scalar inlining by @LeiWang1999 in #2278
[BugFix][CUDA] Lower FP32 MMA operands as TF32 by @JayceSu98 in #2280
[Fix] Remove "stop on other gen" heuristic in kill-point reorder by @Rachmanino in #2204
Fix HIP intrinsic rules registered on tir.* instead of tirx.* by @kashif in #2282
[TIR][Transform] Handle ragged SIMT copy partitioning by @LeiWang1999 in #2285
[Feature] Add float4_e2m1_unpacked dtype by @Rachmanino in #2271
[Backend] Refactor Transform Pipeline to support different backends by @SiriusNEO in #2189
[FIX] pass enable_2cta to ptx_tcgen05_mma_ts in tcgen05 macro generator by @ighoshsubho in #2287
[Transform] Prefer full-thread loop partitioning by @LeiWang1999 in #2288
[TIR][Transform] Fix shared.dyn alias sync analysis by @LeiWang1999 in #2293
[Backend] Promote PassPipeline to backend sub-folder and cleanup Metal Leftover by @SiriusNEO in #2291
[TIR][Transform] Fix ragged SIMT loop partitioning by @LeiWang1999 in #2296
[Reduce][Codegen] Guard packed local reduce ramp loads by @LeiWang1999 in #2298
[Feature] Add stochastic rounding cast for f32 -> fp8/fp4 on CUDA by @LJC00118 in #2260
[BugFix][CuTeDSL] Support TileKernels backend cases by @JayceSu98 in #2289
[BugFix][Transform] Deduplicate DeclBuffer names after loop unrolling by @zhouyangye1076 in #2290
[Transform] Preserve ragged parallel padding guards by @LeiWang1999 in #2299
[Transform] Reduce ragged SIMT copy padding by @LeiWang1999 in #2302
[CUDA] Support preferred copy instruction lowering by @LeiWang1999 in #2303
[Feature] Add read option to TMA store wait by @Rachmanino in #2300
Scalarize vectorized math intrinsics on HIP by @kashif in #2286
[BugFix][Examples] Use tirx in CDNA4 MXFP4 example by @ShigureNyako in #2310
[Backend][Transform] Move backend-specific transforms into separate namespaces by @SiriusNEO in #2297
[CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #2317
[Runtime][Cache] Make tmp dir default follow cache dir by @LeiWang1999 in #2321
[AMD][RDNA4] Fix gfx12 (RDNA 4 / Wave32) related CI issues by @zhangnju in #2313
Fix eager AST handling for *args and **kwargs by @L1ngYi in #2330
[Transform] Place auto WS producers in first warp group by @LeiWang1999 in #2315
[BugFix][Metal] Fix buffer indexing for pipeline-expanded shared memory by @harelhuang in #2325
Remove unused 'customized_code' from the exported symbols in IRBuilder by @erhsh in #2333
[Refactor] Refactor blockscaled TCGEN5, support .f8f6f4/.mxf8f6f4 and restore maint scripts by @Rachmanino in #2274
[BugFix] Fix eager JIT sub-btye shape binding by @Rachmanino in #2334
[TIR][Transform] Partition parallel loops with fragment access by @LeiWang1999 in #2340
[TIR][Transform] Warn on local var reads in assume by @LeiWang1999 in #2341
[Transform] Validate fragment write owner compatibility by @LeiWang1999 in #2343
[BugFix] Reject T.alloc_barrier() on pre-Hopper targets with a clear error by @Hughshine in #2345
[Transform] Respect fragment write owner layouts by @LeiWang1999 in #2349
[CI]: Bump pypa/cibuildwheel from 3.4 to 4.0 by @dependabot[bot] in #2355
[Release] Bump version to 0.1.11 by @LeiWang1999 in #2354

New Contributors

@TensorGlue-IEIT made their first contribution in #2185
@kashif made their first contribution in #2282
@zhouyangye1076 made their first contribution in #2290
@ShigureNyako made their first contribution in #2310
@L1ngYi made their first contribution in #2330
@harelhuang made their first contribution in #2325
@erhsh made their first contribution in #2333
@Hughshine made their first contribution in #2345

Full Changelog: v0.1.10...v0.1.11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.11

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!