What's Changed
- Fix atomic_load access_ptr lowering for dynamic indices by @VitalyAnkh in #2157
- [Example] Add CLC-pipelined 2-CTA GEMM example for sm100 by @ighoshsubho in #2169
- [Feature] Add thread_extent parameter to
T.tma_copyfor flexible TMA copy by @Rachmanino in #2205 - Optimize disk cache source loading by @sepcnt in #2176
- [CuTeDSL] Lower handle_add_byte_offset in Python codegen by @JayceSu98 in #2261
- [FFI][Host] Refactor packed API binder to use FFI asserts by @LeiWang1999 in #2263
- [TileOP] Add scan operators by @LeiWang1999 in #2262
- [Feature] Add CUDA __ffs intrinsic for bit manipulation by @Rachmanino in #2264
- [Bugfix] Fix cached source restore and Metal codegen fallback by @LeiWang1999 in #2266
- [CuTeDSL] Represent tfloat32 storage as Float32 by @JayceSu98 in #2268
- [Feature] Support named barrier arrive by @Rachmanino in #2194
- [BugFix][Examples] Align grouped GEMM backward runner arguments by @JayceSu98 in #2275
- [CUDA][Reduce] Fix packed mixed-dtype reduce casts by @LeiWang1999 in #2276
- [Pipeline] Refactor software pipeline transforms by @LeiWang1999 in #2245
- [Transform] Rewrite MergeSharedMemoryAllocations with per-epoch liveness by @TensorGlue-IEIT in #2185
- [Windows] Gate libtvm compatibility symlinks to Unix by @LeiWang1999 in #2273
- [TIR][Transform] Revert per-epoch shared memory liveness by @LeiWang1999 in #2281
- [Transform][Pipeline] Keep pointer binds out of replayable scalar inlining by @LeiWang1999 in #2278
- [BugFix][CUDA] Lower FP32 MMA operands as TF32 by @JayceSu98 in #2280
- [Fix] Remove "stop on other gen" heuristic in kill-point reorder by @Rachmanino in #2204
- Fix HIP intrinsic rules registered on tir.* instead of tirx.* by @kashif in #2282
- [TIR][Transform] Handle ragged SIMT copy partitioning by @LeiWang1999 in #2285
- [Feature] Add float4_e2m1_unpacked dtype by @Rachmanino in #2271
- [Backend] Refactor Transform Pipeline to support different backends by @SiriusNEO in #2189
- [FIX] pass enable_2cta to ptx_tcgen05_mma_ts in tcgen05 macro generator by @ighoshsubho in #2287
- [Transform] Prefer full-thread loop partitioning by @LeiWang1999 in #2288
- [TIR][Transform] Fix shared.dyn alias sync analysis by @LeiWang1999 in #2293
- [Backend] Promote PassPipeline to backend sub-folder and cleanup Metal Leftover by @SiriusNEO in #2291
- [TIR][Transform] Fix ragged SIMT loop partitioning by @LeiWang1999 in #2296
- [Reduce][Codegen] Guard packed local reduce ramp loads by @LeiWang1999 in #2298
- [Feature] Add stochastic rounding cast for f32 -> fp8/fp4 on CUDA by @LJC00118 in #2260
- [BugFix][CuTeDSL] Support TileKernels backend cases by @JayceSu98 in #2289
- [BugFix][Transform] Deduplicate DeclBuffer names after loop unrolling by @zhouyangye1076 in #2290
- [Transform] Preserve ragged parallel padding guards by @LeiWang1999 in #2299
- [Transform] Reduce ragged SIMT copy padding by @LeiWang1999 in #2302
- [CUDA] Support preferred copy instruction lowering by @LeiWang1999 in #2303
- [Feature] Add read option to TMA store wait by @Rachmanino in #2300
- Scalarize vectorized math intrinsics on HIP by @kashif in #2286
- [BugFix][Examples] Use tirx in CDNA4 MXFP4 example by @ShigureNyako in #2310
- [Backend][Transform] Move backend-specific transforms into separate namespaces by @SiriusNEO in #2297
- [CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #2317
- [Runtime][Cache] Make tmp dir default follow cache dir by @LeiWang1999 in #2321
- [AMD][RDNA4] Fix gfx12 (RDNA 4 / Wave32) related CI issues by @zhangnju in #2313
- Fix eager AST handling for *args and **kwargs by @L1ngYi in #2330
- [Transform] Place auto WS producers in first warp group by @LeiWang1999 in #2315
- [BugFix][Metal] Fix buffer indexing for pipeline-expanded shared memory by @harelhuang in #2325
- Remove unused 'customized_code' from the exported symbols in IRBuilder by @erhsh in #2333
- [Refactor] Refactor blockscaled TCGEN5, support .f8f6f4/.mxf8f6f4 and restore maint scripts by @Rachmanino in #2274
- [BugFix] Fix eager JIT sub-btye shape binding by @Rachmanino in #2334
- [TIR][Transform] Partition parallel loops with fragment access by @LeiWang1999 in #2340
- [TIR][Transform] Warn on local var reads in assume by @LeiWang1999 in #2341
- [Transform] Validate fragment write owner compatibility by @LeiWang1999 in #2343
- [BugFix] Reject T.alloc_barrier() on pre-Hopper targets with a clear error by @Hughshine in #2345
- [Transform] Respect fragment write owner layouts by @LeiWang1999 in #2349
- [CI]: Bump pypa/cibuildwheel from 3.4 to 4.0 by @dependabot[bot] in #2355
- [Release] Bump version to 0.1.11 by @LeiWang1999 in #2354
New Contributors
- @TensorGlue-IEIT made their first contribution in #2185
- @kashif made their first contribution in #2282
- @zhouyangye1076 made their first contribution in #2290
- @ShigureNyako made their first contribution in #2310
- @L1ngYi made their first contribution in #2330
- @harelhuang made their first contribution in #2325
- @erhsh made their first contribution in #2333
- @Hughshine made their first contribution in #2345
Full Changelog: v0.1.10...v0.1.11