Skip to content

Navigation Menu

Appearance settings

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Appearance settings

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

intel / xFasterTransformer Public

Notifications You must be signed in to change notification settings
Fork 69
Star 427

Code
Issues 19
Pull requests 10
Actions
Projects
Wiki
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Wiki
Security
Insights

Releases: intel/xFasterTransformer

Releases · intel/xFasterTransformer

v2.1.2

17 Jun 02:19

Duyi-Wang

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

v2.1.2 Latest

Latest

v2.1.2

Models

Fix DeepSeek-R1 convert issue.

Performance

Upgrade xDNN to add new method of FP8 conversion to optimize DeepSeek-R1.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

v2.1.1

07 May 02:52

Duyi-Wang

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

v2.1.1

v2.1.1

Performance

Optimize Qwen3 MOE models conversion.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

v2.1.0 Qwen3 Series models supported!🎉

29 Apr 08:11

Duyi-Wang

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

v2.1.0 Qwen3 Series models supported!🎉

v2.1.0 Qwen3 Series models supported!🎉

Models

Support Qwen3 series models.

Performance

Optimize DeepSeek-R1 fp8_e4m3 performance.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

v2.0.0 DeepSeek-R1 671B supported!🎉

26 Mar 07:53

Duyi-Wang

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

v2.0.0 DeepSeek-R1 671B supported!🎉

v2.0.0

Models

Support DeepSeek-R1 671B with fp8_e4m3 dtype, using bf16 kv cache dtype.
Support Mixtral MoE series models.
Support TeleChat model.

What's Changed

Generated release nots

What's Changed

Bump gradio from 4.37.2 to 5.0.0 in /examples/web_demo by @dependabot in #479
Bump gradio from 5.0.0 to 5.5.0 in /examples/web_demo by @dependabot in #483
[API] Add layernorm FP16 support; by @wenhuanh in #485
Bump gradio from 5.5.0 to 5.11.0 in /examples/web_demo by @dependabot in #488
Fix bug for EMR SNC-2 mode benchmark by @qiuyuleng1 in #484
Fix bugs in mpirun commands by @zsym-sjtu in #487
[web demo] Add thinking process for demo by @wenhuanh in #492

New Contributors

@qiuyuleng1 made their first contribution in #484
@zsym-sjtu made their first contribution in #487

Full Changelog: v1.8.2...v2.0.0

Contributors

dependabot, wenhuanh, and 2 other contributors

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

v1.8.2

10 Oct 08:17

Duyi-Wang

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

v1.8.2

v1.8.2

Performance

Enable flash attention by default for W8A8 dtype to accelerate the performance of the 1st token.

Benchmark

When the number of ranks is 1, run in single mode to avoid the dependency on mpirun.
Support SNC-3 platform.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

v1.8.1

31 Jul 08:08

Duyi-Wang

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

v1.8.1

v1.8.1

Functionality

Expose the interface of embedding lookup.

Performance

Optimized the performance of grouped query attention (GQA).
Enhanced the performance of creating keys for the oneDNN primitive cache.
Set the [bs][nh][seq][hs] layout as the default for KV Cache, resulting in better performance.
Improved the task split imbalance issue in self-attention.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

v1.8.0 Continuous Batching on Single ARC GPU and AMX_FP16 Support.

23 Jul 01:25

Duyi-Wang

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

v1.8.0 Continuous Batching on Single ARC GPU and AMX_FP16 Support.

Highlight

Continuous Batching on Single ARC GPU is supported and can be integrated by vllm-xft.
Introduce Intel AMX instructions support for float16 data type.

Models

Support ChatGLM4 series models.
Introduce BF16/FP16 full path support for Qwen series models.

BUG fix

Fixed memory leak of oneDNN primitive cache.
Fixed SPR-HBM flat QUAD mode detect issue in benchmark scripts.
Fixed heads Split error for distributed Grouped-query attention(GQA).
Fixed an issue with the invokeAttentionLLaMA API.

What's Changed

Generated release nots

What's Changed

[Kernel] Enable continuous batching on single GPU. by @changqi1 in #452
[Bugfix] fixed shm reduceAdd & rope error when batch size is large by @abenmao in #457
[Feature] Enable AMX FP16 on next generation CPU by @wenhuanh in #456
[Kernel] Cache oneDNN primitive when M < XFT_PRIMITIVE_CACHE_M, default 256. by @Duyi-Wang in #460
[Denpendency] Pin python requirements.txt version. by @Duyi-Wang in #458
[Dependency] Bump web_demo requirement. by @Duyi-Wang in #463
[Layers] Enable AMX FP16 of FlashAttn by @abenmao in #459
[Layers] Fix invokeAttentionLLaMA API by @wenhuanh in #464
[Readme] Add accepted papers by @wenhuanh in #465
[Kernel] Make SelfAttention prepared for AMX_FP16; More balanced task split in Cross Attention by @pujiang2018 in #466
[Kernel] Upgrade xDNN to v1.5.2 and make AMX_FP16 work by @pujiang2018 in #468

Full Changelog: v1.7.3...v1.8.0

Contributors

abenmao, changqi1, and 3 other contributors

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

lin72h reacted with hooray emoji

All reactions

🎉 1 reaction

1 person reacted

v1.7.3

01 Jul 01:52

Duyi-Wang

Compare

Choose a tag to compare

Loading

v1.7.3

BUG fix

Fixed SHM reduceAdd & rope error when batch size is large.
Fixed the issue of abnormal usage of oneDNN primitive cache.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.

18 Jun 05:07

Duyi-Wang

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.

v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.

Functionality

Add continuous batching support of Qwen 1.0 models.
Enable hybrid data types for continuous batching feature, including BF16_FP16, BF16_INT8, BF16_W8A8, BF16_INT4, BF16_NF4, W8A8_INT8, W8A8_int4, W8A8_NF4.

BUG fix

Fixed the convert fault in Baichuan1 models.

What's Changed

Generated release nots

[Doc] Add vllm benchmark docs. by @marvin-Yu in #448
[Kernel] Add GPU kernels and enable LLaMA model. by @changqi1 in #372
[Tools] Add Baichuan1/2 convert tool by @abenmao in #451
[Layers] Add qwenRope support for Qwen1.0 in CB mode by @abenmao in #449
[Framework] Remove duplicated code by @xiangzez in #450
[Model] Support hybrid model in continuous batching. by @Duyi-Wang in #453
[Version] v1.7.2. by @Duyi-Wang in #454

Full Changelog: v1.7.1...v1.7.2

Contributors

marvin-Yu, abenmao, and 3 other contributors

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

v1.7.1 - Continuous batching feature supports ChatGLM2/3.

12 Jun 05:27

Duyi-Wang

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

v1.7.1 - Continuous batching feature supports ChatGLM2/3.

v1.7.1 - Continuous batching feature supports ChatGLM2/3.

Functionality

Add continuous batching support of ChatGLM2/3 models.
Qwen2Convert supports quantized Qwen2 models by GPTQ, such as GPTQ-Int8 and GPTQ-Int4, by param from_quantized_model="gptq".

BUG fix

Fixed the segament fault error when running with more than 2 ranks in vllm-xft serving.

What's Changed

Generated release nots

[README] Update README.md. by @Duyi-Wang in #434
[README] Update README.md. by @Duyi-Wang in #435
[Common]Add INT8/UINT4 to BF16 weight convert by @xiangzez in #436
Add Continue Batching support for Chatglm2/3 by @a3213105 in #438
[Model] Add Qwen2 GPTQ model support by @xiangzez in #439
[Model] Fix array out of bounds when rank > 2. by @Duyi-Wang in #441
Bump gradio from 4.19.2 to 4.36.0 in /examples/web_demo by @dependabot in #442
[Version] v1.7.1. by @Duyi-Wang in #445

Full Changelog: v1.7.0...v1.7.1

Contributors

a3213105, dependabot, and 2 other contributors

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

Previous 1 2 3 Next

Previous Next

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.