Tags · deepspeedai/DeepSpeed

v0.16.4

Fix, bf16 optimizer remove dup loop (#7054)

bf16 with moe refresh optimizer state from bf16 ckpt will raise
IndexError: list index out of range

Signed-off-by: shaomin <wukon1992@gmail.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>

Feb 20, 2025
e2dc3ee
zip
tar.gz
Notes

v0.16.3

Using explicit GPU upcast for ZeRO-Offload (#6962)

Following discussion in
[PR-6670](#6670), the explict
upcast is much more efficient than implicit upcast, this PR is to
replace implicit upcast with explict one.

The results on 3B model are shown below:

| Option | BWD (ms) | Speed up |
|------------|-----|------|
| Before PR-6670 | 25603.30 | 1x |
| After PR-6670 | 1174.31 | 21.8X |
| After this PR| 309.2 | 82.8X |

Jan 21, 2025
c17dc33
zip
tar.gz
Notes

v0.16.2

Update code owners (#6890)

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

Dec 18, 2024
b344c04
zip
tar.gz
Notes

v0.16.1

Pin transformers version in cpu-torch-latest due to multiprocessing e…

…rror. (#6823)

This is a copy of #6820 for
the cpu-torch-latest tests.

This PR will revert/fix these:
#6822

Dec 5, 2024
95ead2a
zip
tar.gz
Notes

v0.16.0

Revert release workflow (#6785)

Nov 25, 2024
e5570b1
zip
tar.gz
Notes

v0.15.4

Switch what versions of python are supported (#5676)

Add support for testing compilation with python 3.11/3.12.  

Also add the dockerfiles used to build those images.

---------

Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>

Nov 7, 2024
a1b0c35
zip
tar.gz
Notes

v0.15.3

[XPU] [DeepNVMe] use same cpu_op_desc_t with cuda (#6645)

We have found that #6592 uses `_pinned_tensor_mgr` to create cpu bounce
buffer, which is same with what our xpu accelerator currently doing.
So no need to use xpu device specific cpu_op_desc_t.
In this PR:
1. remove custom csrc/xpu/aio/deepspeed_cpu_op.cpp
2. modify xpu async_io opbuilder.

This issue cannot be easily done with revert #6532 , for we added some
source file as last time GDS feature going in DS. So file this new PR :)

Oct 22, 2024
a24cdd6
zip
tar.gz
Notes

v0.15.2

Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551)

Modified _replace_module in auto_tp.py :
The modification keeps the layers 'shared_expert_gate' and 'gate' in
qwen2-moe the original type torch.nn.Linear and not changes them into
LinearLayer. In this way, their weights will not be split into multiple
HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards.
Since the weights of 'gate' are not split into multiple HPU/GPU cards,
all gather operations are not needed, which may improve performance.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

Oct 9, 2024
474a328
zip
tar.gz
Notes

v0.15.1

Handle an edge case where `CUDA_HOME` is not defined on ROCm systems (#…

…6488)

* Handles an edge case when building `gds` where `CUDA_HOME` is not
defined on ROCm systems

Sep 4, 2024
10ba3dd
zip
tar.gz
Notes

v0.15.0

Fix torch check (#6402)

Aug 22, 2024
55b4cae
zip
tar.gz
Notes

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.16.4

v0.16.3

v0.16.2

v0.16.1

v0.16.0

v0.15.4

v0.15.3

v0.15.2

v0.15.1

v0.15.0

Tags: deepspeedai/DeepSpeed