Skip to content

Tags: deepspeedai/DeepSpeed

Tags

v0.16.4

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix, bf16 optimizer remove dup loop (#7054)

bf16 with moe refresh optimizer state from bf16 ckpt will raise
IndexError: list index out of range

Signed-off-by: shaomin <wukon1992@gmail.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>

v0.16.3

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Using explicit GPU upcast for ZeRO-Offload (#6962)

Following discussion in
[PR-6670](#6670), the explict
upcast is much more efficient than implicit upcast, this PR is to
replace implicit upcast with explict one.

The results on 3B model are shown below:

| Option | BWD (ms) | Speed up |
|------------|-----|------|
| Before PR-6670 | 25603.30 | 1x |
| After PR-6670 | 1174.31 | 21.8X |
| After this PR| 309.2 | 82.8X |

v0.16.2

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Update code owners (#6890)

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

v0.16.1

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Pin transformers version in cpu-torch-latest due to multiprocessing e…

…rror. (#6823)

This is a copy of #6820 for
the cpu-torch-latest tests.

This PR will revert/fix these:
#6822

v0.16.0

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Revert release workflow (#6785)

v0.15.4

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Switch what versions of python are supported (#5676)

Add support for testing compilation with python 3.11/3.12.  

Also add the dockerfiles used to build those images.

---------

Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>

v0.15.3

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[XPU] [DeepNVMe] use same cpu_op_desc_t with cuda (#6645)

We have found that #6592 uses `_pinned_tensor_mgr` to create cpu bounce
buffer, which is same with what our xpu accelerator currently doing.
So no need to use xpu device specific cpu_op_desc_t.
In this PR:
1. remove custom csrc/xpu/aio/deepspeed_cpu_op.cpp
2. modify xpu async_io opbuilder.

This issue cannot be easily done with revert #6532 , for we added some
source file as last time GDS feature going in DS. So file this new PR :)

v0.15.2

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551)

Modified _replace_module in auto_tp.py :
The modification keeps the layers 'shared_expert_gate' and 'gate' in
qwen2-moe the original type torch.nn.Linear and not changes them into
LinearLayer. In this way, their weights will not be split into multiple
HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards.
Since the weights of 'gate' are not split into multiple HPU/GPU cards,
all gather operations are not needed, which may improve performance.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

v0.15.1

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Handle an edge case where `CUDA_HOME` is not defined on ROCm systems (#…

…6488)

* Handles an edge case when building `gds` where `CUDA_HOME` is not
defined on ROCm systems

v0.15.0

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix torch check (#6402)