Releases: intel/xFasterTransformer
Releases · intel/xFasterTransformer
v2.1.2
v2.1.1
v2.1.0 Qwen3 Series models supported!🎉
v2.1.0 Qwen3 Series models supported!🎉
Models
- Support Qwen3 series models.
Performance
- Optimize DeepSeek-R1 fp8_e4m3 performance.
v2.0.0 DeepSeek-R1 671B supported!🎉
v2.0.0
Models
- Support DeepSeek-R1 671B with
fp8_e4m3
dtype, usingbf16
kv cache dtype. - Support Mixtral MoE series models.
- Support TeleChat model.
What's Changed
Generated release nots
What's Changed
- Bump gradio from 4.37.2 to 5.0.0 in /examples/web_demo by @dependabot in #479
- Bump gradio from 5.0.0 to 5.5.0 in /examples/web_demo by @dependabot in #483
- [API] Add layernorm FP16 support; by @wenhuanh in #485
- Bump gradio from 5.5.0 to 5.11.0 in /examples/web_demo by @dependabot in #488
- Fix bug for EMR SNC-2 mode benchmark by @qiuyuleng1 in #484
- Fix bugs in mpirun commands by @zsym-sjtu in #487
- [web demo] Add thinking process for demo by @wenhuanh in #492
New Contributors
- @qiuyuleng1 made their first contribution in #484
- @zsym-sjtu made their first contribution in #487
Full Changelog: v1.8.2...v2.0.0
v1.8.2
v1.8.1
v1.8.1
Functionality
- Expose the interface of embedding lookup.
Performance
- Optimized the performance of grouped query attention (GQA).
- Enhanced the performance of creating keys for the oneDNN primitive cache.
- Set the [bs][nh][seq][hs] layout as the default for KV Cache, resulting in better performance.
- Improved the task split imbalance issue in self-attention.
v1.8.0 Continuous Batching on Single ARC GPU and AMX_FP16 Support.
Highlight
- Continuous Batching on Single ARC GPU is supported and can be integrated by
vllm-xft
. - Introduce Intel AMX instructions support for
float16
data type.
Models
- Support ChatGLM4 series models.
- Introduce BF16/FP16 full path support for Qwen series models.
BUG fix
- Fixed memory leak of oneDNN primitive cache.
- Fixed SPR-HBM flat QUAD mode detect issue in benchmark scripts.
- Fixed heads Split error for distributed Grouped-query attention(GQA).
- Fixed an issue with the invokeAttentionLLaMA API.
What's Changed
Generated release nots
What's Changed
- [Kernel] Enable continuous batching on single GPU. by @changqi1 in #452
- [Bugfix] fixed shm reduceAdd & rope error when batch size is large by @abenmao in #457
- [Feature] Enable AMX FP16 on next generation CPU by @wenhuanh in #456
- [Kernel] Cache oneDNN primitive when M <
XFT_PRIMITIVE_CACHE_M
, default 256. by @Duyi-Wang in #460 - [Denpendency] Pin python requirements.txt version. by @Duyi-Wang in #458
- [Dependency] Bump web_demo requirement. by @Duyi-Wang in #463
- [Layers] Enable AMX FP16 of FlashAttn by @abenmao in #459
- [Layers] Fix invokeAttentionLLaMA API by @wenhuanh in #464
- [Readme] Add accepted papers by @wenhuanh in #465
- [Kernel] Make SelfAttention prepared for AMX_FP16; More balanced task split in Cross Attention by @pujiang2018 in #466
- [Kernel] Upgrade xDNN to v1.5.2 and make AMX_FP16 work by @pujiang2018 in #468
Full Changelog: v1.7.3...v1.8.0
v1.7.3
v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.
v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.
Functionality
- Add continuous batching support of Qwen 1.0 models.
- Enable hybrid data types for continuous batching feature, including
BF16_FP16, BF16_INT8, BF16_W8A8, BF16_INT4, BF16_NF4, W8A8_INT8, W8A8_int4, W8A8_NF4
.
BUG fix
- Fixed the convert fault in Baichuan1 models.
What's Changed
Generated release nots
- [Doc] Add vllm benchmark docs. by @marvin-Yu in #448
- [Kernel] Add GPU kernels and enable LLaMA model. by @changqi1 in #372
- [Tools] Add Baichuan1/2 convert tool by @abenmao in #451
- [Layers] Add qwenRope support for Qwen1.0 in CB mode by @abenmao in #449
- [Framework] Remove duplicated code by @xiangzez in #450
- [Model] Support hybrid model in continuous batching. by @Duyi-Wang in #453
- [Version] v1.7.2. by @Duyi-Wang in #454
Full Changelog: v1.7.1...v1.7.2
v1.7.1 - Continuous batching feature supports ChatGLM2/3.
v1.7.1 - Continuous batching feature supports ChatGLM2/3.
Functionality
- Add continuous batching support of ChatGLM2/3 models.
- Qwen2Convert supports quantized Qwen2 models by GPTQ, such as GPTQ-Int8 and GPTQ-Int4, by param
from_quantized_model="gptq"
.
BUG fix
- Fixed the segament fault error when running with more than 2 ranks in vllm-xft serving.
What's Changed
Generated release nots
- [README] Update README.md. by @Duyi-Wang in #434
- [README] Update README.md. by @Duyi-Wang in #435
- [Common]Add INT8/UINT4 to BF16 weight convert by @xiangzez in #436
- Add Continue Batching support for Chatglm2/3 by @a3213105 in #438
- [Model] Add Qwen2 GPTQ model support by @xiangzez in #439
- [Model] Fix array out of bounds when rank > 2. by @Duyi-Wang in #441
- Bump gradio from 4.19.2 to 4.36.0 in /examples/web_demo by @dependabot in #442
- [Version] v1.7.1. by @Duyi-Wang in #445
Full Changelog: v1.7.0...v1.7.1