issues Search Results · repo:deepspeedai/DeepSpeed language:Python
Filter by
3k results
(99 ms)3k results
indeepspeedai/DeepSpeed (press backspace or delete to remove)Describe the bug When finetune Qwen2.5-14B with ZeRO2+offload on 4xA100 40G cards, got GPU OOM error.
To Reproduce Config file:
{
train_batch_size : 8,
bf16 : { enabled : true },
zero_optimization ...
bug
training
delock
- 3
- Opened 9 hours ago
- #7482
Is your feature request related to a problem? Please describe. DeepSpeedZeroOptimizer uses sp_process_group to partition
gradient parameters. Is it possible to use tp_parallel group instead? Otherwise ...
enhancement
Sirorezka
- Opened 2 days ago
- #7480
Description
Currently, DeepSpeed offers --bind_cores_to_rank and --bind_core_list flags to bind CPU cores, but these require
explicit specification from the user. While core binding works, it is not fully ...
enhancement
Antlera
- 6
- Opened 2 days ago
- #7478
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is.
Ex. I m always frustrated when [...]
Describe the solution you d like A clear and ...
enhancement
Asil456
- Opened 3 days ago
- #7476
I followed the guide to perform checkpoint testing, and FastPersist performed very well.
https://github.com/deepspeedai/DeepSpeedExamples/tree/master/deepnvme/model_checkpoint I d like to know if the
open-source ...
Buddingpopp
- Opened 3 days ago
- #7475
the code is based on
https://github.com/deepspeedai/DeepSpeed/blob/master/docs/_tutorials/ulysses-alst-sequence-parallelism.md but training
with a wiki.train.raw dataset
# train.py
from deepspeed.runtime.sequence_parallel.ulysses_sp ...
bug
training
kechunFIVE
- Opened 3 days ago
- #7473
Describe the bug A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior:
1. Go to ...
2. Click on ....
3. Scroll down to ....
4. See error
Expected behavior ...
bug
training
BrenchCC
- Opened 4 days ago
- #7471
/core/parallel_state.py , line 714, in initialize_model_parallel
[rank0]: group = torch.distributed.new_group(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File /mnt/e/conda-py311-cu124-torch27/lib/python3.11/site-packages/torch/distributed/c10d_logger.py ...
bug
training
janelu9
- Opened 4 days ago
- #7470
Hi, I m having troubles installing deepspeed with additional flags. When I run
export NCCL_HOME=$CONDA_PREFIX/lib/python3.11/site-packages/nvidia/nccl
DS_BUILD_OPS=1 DS_BUILD_TRANSFORMER_INFERENCE=1 pip ...
Sirorezka
- Opened 5 days ago
- #7468
The Nightly CI for https://github.com/deepspeedai/DeepSpeed/actions/runs/16868139606 failed.
ci-failure
github-actions[bot]
- Opened 5 days ago
- #7467

Learn how you can use GitHub Issues to plan and track your work.
Save views for sprints, backlogs, teams, or releases. Rank, sort, and filter issues to suit the occasion. The possibilities are endless.Learn more about GitHub IssuesProTip!
Restrict your search to the title by using the in:title qualifier.
Learn how you can use GitHub Issues to plan and track your work.
Save views for sprints, backlogs, teams, or releases. Rank, sort, and filter issues to suit the occasion. The possibilities are endless.Learn more about GitHub IssuesProTip!
Press the /
key to activate the search input again and adjust your query.