v0.1.8
What's Changed
- kernel: use f32 storage & expf for better precision by @DefTruth in #162
- feat: support attn-mask for triton fwd/bwd by @DefTruth in #163
- feat: support dropout_p for triton fwd/bwd by @DefTruth in #164
- feat: support flash decode for triton bwd Nq=1 by @DefTruth in #165
- kernel: fix philox state for dropout by @DefTruth in #166
- chore: fix docs broken by @DefTruth in #167
- chore: fix docs typo by @DefTruth in #168
- CUDA: clean up cuda kernel codebase by @DefTruth in #169
- kernel: optimize the precision for attn_mask by @DefTruth in #170
- kernel: clean up triton bwd v1 kernel by @DefTruth in #171
- chore: add perf script by @DefTruth in #172
- autotune: fine-grained bucket size for max tune by @DefTruth in #173
- kernel: fix tl.dot dtype for bwd by @DefTruth in #174
- kernel: better load/compute overlap for bwd by @DefTruth in #175
- kernel: optional v grad fp32 buffer for better precision by @DefTruth in #176
- kernel: remove unused num_k_blocks by @DefTruth in #177
- [1/N] autotune: support persistent tuned configs by @DefTruth in #178
- [2/N] autotune: support persistent tuned configs by @DefTruth in #179
- perf: add tflops perf plots by @DefTruth in #180
- [2/N] perf: add tflops perf plots by @DefTruth in #181
Full Changelog: v0.1.7...v0.1.8