v1.3.6
What's Changed
- chore: update cache-dit arch by @DefTruth in #932
- bc: deprecated serving module by @DefTruth in #933
- chore: suppress torch compile tuning logs by @DefTruth in #934
- compile: enabled descent_tuning by default by @DefTruth in #935
- docs: update quantization docs by @DefTruth in #937
- quant: add quantize backend enum by @DefTruth in #938
- kernel: refactor ops register by @DefTruth in #939
- chore: fix vllm-omni docs links by @DefTruth in #940
- examples: add cuda graph option by @DefTruth in #942
- chore: fix utils log info by @DefTruth in #943
- chore: add cuda graph usage docs by @DefTruth in #944
- chore: add cuda graph usage to overview by @DefTruth in #945
- CLI: add compile full-graph option by @DefTruth in #946
- chore: fix fullgraph param typo by @DefTruth in #947
- docs: add more cuda graph perf results by @DefTruth in #948
- docs: add more cuda graph perf results by @DefTruth in #949
- docs: update cuda graphs docs by @DefTruth in #950
- chore: allow cuda graph for dynamic compile by @DefTruth in #951
- feat: support cuda graph + fp8 rowwise by @DefTruth in #952
- chore: hotfix for mkdocs broken by @DefTruth in #953
- [1/N] feat: support svdquant w4a4 - kernels & skills by @DefTruth in #954
- pytest: fast_svd mode for testing by @DefTruth in #955
- [2/N] feat: streaming quantize for svdquant by @DefTruth in #956
- [3/N] feat: PTQ workflow for svdquant by @DefTruth in #957
- SKILL: add ptq-workflow-integration skill by @DefTruth in #958
- pytest: separate kernels and quantization tests by @DefTruth in #959
- chore: add docs strings to codebase by @DefTruth in #960
- chore: add svdq e2e example and format code by @DefTruth in #961
- SKILL: add Cute-DSL/CUDA/CUTLASS skills by @DefTruth in #962
- chore: update docs by @DefTruth in #963
- kernel: tune svdq w4a4 gemm stage/blk size for Ada by @DefTruth in #966
- kernel: unified ops register policy by @DefTruth in #967
- bench: refactor cache-dit bench by @DefTruth in #968
- svdquant: fast svd decompose, ~18x speedup by @DefTruth in #969
- [2/N] tune svdq w4a4 gemm for ada by @DefTruth in #970
- bc: refactor distributed codebase by @DefTruth in #971
- kernel: add cute-dsl based merge-attn-states kernel by @DefTruth in #973
- feat: extend SVDQ PTQ -> SVDQ DQ by @DefTruth in #974
- fix: support 3D input/output for W4A4 linear by @DefTruth in #975
- chore: support svdq-calib option in examples by @DefTruth in #976
- kernel: add cute-dsl based fp8 comm kernels by @DefTruth in #977
- [1/N] feat: support cute-dsl based svdquant w4a4 by @DefTruth in #978
- feat: support svdq-dq few shot by @DefTruth in #979
- chore: update svdq-dq few shot docs by @DefTruth in #980
- feat: support layerwise cpu offload by @DefTruth in #981
- [2/N] feat: support layerwise offload by @DefTruth in #982
- [3/N] feat: support layerwise offload by @DefTruth in #983
- [4/N] feat: support layerwise offload by @DefTruth in #984
- chore: unified all2all/ring comm api by @DefTruth in #985
- chore: refactor async ulysses codebase by @DefTruth in #986
- remove cutedsl based svdq kernels by @DefTruth in #987
- fix tensor parallel register import error by @DefTruth in #988
- feat: support sub cp_plan for context parallel by @DefTruth in #989
- chore: fix attention dispatch comments by @DefTruth in #990
- community: add tensorrt-llm x cache-dit link by @DefTruth in #991
- deps: use uv to install deps by @DefTruth in #992
- chore: update docs by @DefTruth in #993
- chore: add layerwise offload to overview by @DefTruth in #994
- chore: update layerwise offload cli quick start by @DefTruth in #995
- attention: fix sage-attn backend dispatch by @DefTruth in #996
- chore: add exclude-layers param to ptq example by @DefTruth in #997
- attention: separate attn backends by @DefTruth in #998
- svdq: support converter cli for dq workflow by @DefTruth in #999
- chore: fix typo by @DefTruth in #1000
- chore: revise quantization example in README by @DefTruth in #1001
- chore: update README by @DefTruth in #1002
- feat: support ray wrapper by @DefTruth in #1003
Full Changelog: v1.3.5...v1.3.6