What's Changed
- Remove trailing whitespaces and add missing newlines at EOF by @trueNAHO in #530
- Fix LD_AUDIT segfaulting in certain configurations by @vosen in #533
- Fix post-merge tests by @vosen in #535
- Update devcontainer Dockerfile, bump CUDA version to 13, split cuDNN into v8 and v9 by @vosen in #536
- Regenerate LLVM tests by @vosen in #538
- Add test for
mmaby @zluda-violet in #540 - Add MIOpen and connect it to zluda_dnn by @vosen in #539
- Add mma x2 test by @zluda-violet in #547
- Improve performance of instruction_mode_to_global_mode compiler pass by @vosen in #544
- Fix regression in instruction_mode_to_global_mode by @vosen in #552
- Add more missing BLAS by @vosen in #553
- Implement functionality required by katago by @vosen in #541
- Fix LLVM tests failing because of a variable order of declarations by @vosen in #557
- Fix bugs exposed by rust-cuda by @vosen in #563
- Small fix for Rust-CUDA support by @stevefan1999-personal in #560
- Build and distribute LLVM by @zluda-violet in #555
- Use different HiGHS solver (fixes crashes in some ptx modules) by @vosen in #565
- Emit llvm.zluda.mma intrinsic for MMA by @zluda-violet in #546
- Improve quality of instruction_mode_to_global_mode pass by @vosen in #567
- Implement more low precision ML instructions by @vosen in #554
- Update LLVM output by @zluda-violet in #568
- Implement fallback MMA for RDNA1 and RDNA2 by @vosen in #551
- Various LLVM fixes and improvements by @vosen in #570
- Try to pick a more appropriate ptx from fatbin by @vosen in #569
- Add a precompilation tool by @vosen in #558
- Support ld.global.v8 by @zluda-violet in #572
- Support st.global.v8 by @zluda-violet in #573
- Improve Windows loader (zluda.exe) by @vosen in #550
- Add parser support for
.noreturnby @zluda-violet in #575 - Add
createpolicy.fractionalas nop by @zluda-violet in #576 - Add error message for PtxError::Todo to make debugging easier by @zluda-violet in #577
- Additional NVML functionality by @zluda-violet in #574
- Use LLVM for optimized i8 MMAs by @vosen in #571
- Enable ROCm7 support on Linux and Windows by @vosen in #579
- Fix bug where ignored directives are treated as invalid by @zluda-violet in #578
- Allow implicit conversion from vec to bit scalar for ld by @zluda-violet in #580
- Swap type and state space in variable printing for pass tests by @zluda-violet in #581
- [CLEANUP] Rename refactored modules, update test golden files by @zluda-violet in #582
- Increase sccache max frame length (fixes Windows builds) by @zluda-violet in #593
- Implement bmsk.clamp.b32 by @zluda-violet in #590
- Update SCCACHE_MAX_FRAME_LENGTH for post-merge builds by @vosen in #594
- Implement cuMemHostGetDevicePointer_v2 by @Knogle in #595
- Fix conflict for initial rounding and denormal mode for non-kernel functions by @zluda-violet in #596
- Implicitly convert constants from float to bit type by @zluda-violet in #583
- Allow implicit conversion from bit scalar to vec for st by @zluda-violet in #585
- Correctly zero-initialize globals by @vosen in #588
- When building ZLUDA in CI, make sure we build Linux binaries compatible with both ROCm 6 and ROCm 7 by @vosen in #589
- Add CUDA 13.1 compatibility by @vosen in #599
- Stop failing on bf16 uint_to_fp on amdgpu < gfx11 by @vosen in #601
- Update docs (add llama.cpp, zluda_precompile sections) by @vosen in #602
- Use partial parsing result in release mode by @vosen in #603
- Add sad and dp2a instructions by @vosen in #605
- Host functions for vLLM by @zluda-violet in #606
- Implement extended precision integer addition by @zluda-violet in #607
- ctlz bug fix by @zluda-violet in #610
- Fix loading CUDA modules from dark api and add a tool to verify Windows library loading by @vosen in #612
- Finish extended precision arithmetic by @zluda-violet in #609
- Enable handle from cublasCreate to be used in cublasLt calls by @zluda-violet in #587
- Support cuMemAllocPitch_v2 and cuMemcpy2D_v2 by @vosen in #616
- Add various bits and pieces required by pytorch by @vosen in #615
- Support some cublaslt settings required by COEIROINK by @vosen in #619
- Add minimal cuSPARSE by @vosen in #621
- PyTorch fixes and improvements by @vosen in #620
- Initial textures support by @vosen in #625
- Add more cuSPARSE functions by @vosen in #624
- Support vshr.u32.u32.u32.clamp.add by @vosen in #629
- Refactor emit_brev to use emit_intrinsic helper by @hemangjoshi37a in #631
- Update tests by @vosen in #632
- Fix typo: vec_acccess -> vector_read in emit_vector_read by @hemangjoshi37a in #633
- Remove redundant map_err(CompilerError::from) calls in compiler by @hemangjoshi37a in #635
- 32 bit support in the compiler by @vosen in #637
- Fix typo: compatiblity -> compatibility in xtask comment by @hemangjoshi37a in #639
- Fix typo: overriden -> overridden in zluda_redirect by @hemangjoshi37a in #641
- Implement match.any.sync, fix popc.b64 by @vosen in #642
- Fix clz.b64 and add bfind.shiftamt by @vosen in #644
- Add cuDeviceGetPCIBusId by @vosen in #645
- Minor improvements for PyTorch by @vosen in #643
- Update compiler to ROCm 7.2 and make some minor compiler fixes by @vosen in #649
- More minor compiler improvements by @vosen in #650
New Contributors
- @trueNAHO made their first contribution in #530
- @stevefan1999-personal made their first contribution in #560
- @Knogle made their first contribution in #595
- @hemangjoshi37a made their first contribution in #631
Full Changelog: v5...v6