Skip to content

issues Search Results · repo:deepspeedai/DeepSpeed language:Python

Filter by

3k results
 (99 ms)

3k results

indeepspeedai/DeepSpeed (press backspace or delete to remove)

Describe the bug When finetune Qwen2.5-14B with ZeRO2+offload on 4xA100 40G cards, got GPU OOM error. To Reproduce Config file: { train_batch_size : 8, bf16 : { enabled : true }, zero_optimization ...
bug
training
  • delock
  • 3
  • Opened 
    9 hours ago
  • #7482

Is your feature request related to a problem? Please describe. DeepSpeedZeroOptimizer uses sp_process_group to partition gradient parameters. Is it possible to use tp_parallel group instead? Otherwise ...
enhancement
  • Sirorezka
  • Opened 
    2 days ago
  • #7480

Description Currently, DeepSpeed offers --bind_cores_to_rank and --bind_core_list flags to bind CPU cores, but these require explicit specification from the user. While core binding works, it is not fully ...
enhancement
  • Antlera
  • 6
  • Opened 
    2 days ago
  • #7478

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I m always frustrated when [...] Describe the solution you d like A clear and ...
enhancement
  • Asil456
  • Opened 
    3 days ago
  • #7476

I followed the guide to perform checkpoint testing, and FastPersist performed very well. https://github.com/deepspeedai/DeepSpeedExamples/tree/master/deepnvme/model_checkpoint I d like to know if the open-source ...
  • Buddingpopp
  • Opened 
    3 days ago
  • #7475

the code is based on https://github.com/deepspeedai/DeepSpeed/blob/master/docs/_tutorials/ulysses-alst-sequence-parallelism.md but training with a wiki.train.raw dataset # train.py from deepspeed.runtime.sequence_parallel.ulysses_sp ...
bug
training
  • kechunFIVE
  • Opened 
    3 days ago
  • #7473

Describe the bug A clear and concise description of what the bug is. To Reproduce Steps to reproduce the behavior: 1. Go to ... 2. Click on .... 3. Scroll down to .... 4. See error Expected behavior ...
bug
training
  • BrenchCC
  • Opened 
    4 days ago
  • #7471

/core/parallel_state.py , line 714, in initialize_model_parallel [rank0]: group = torch.distributed.new_group( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File /mnt/e/conda-py311-cu124-torch27/lib/python3.11/site-packages/torch/distributed/c10d_logger.py ...
bug
training
  • janelu9
  • Opened 
    4 days ago
  • #7470

Hi, I m having troubles installing deepspeed with additional flags. When I run export NCCL_HOME=$CONDA_PREFIX/lib/python3.11/site-packages/nvidia/nccl DS_BUILD_OPS=1 DS_BUILD_TRANSFORMER_INFERENCE=1 pip ...
  • Sirorezka
  • Opened 
    5 days ago
  • #7468

The Nightly CI for https://github.com/deepspeedai/DeepSpeed/actions/runs/16868139606 failed.
ci-failure
  • github-actions[bot]
  • Opened 
    5 days ago
  • #7467
Issue origami icon

Learn how you can use GitHub Issues to plan and track your work.

Save views for sprints, backlogs, teams, or releases. Rank, sort, and filter issues to suit the occasion. The possibilities are endless.Learn more about GitHub Issues
ProTip! 
Restrict your search to the title by using the in:title qualifier.
Issue origami icon

Learn how you can use GitHub Issues to plan and track your work.

Save views for sprints, backlogs, teams, or releases. Rank, sort, and filter issues to suit the occasion. The possibilities are endless.Learn more about GitHub Issues
ProTip! 
Press the
/
key to activate the search input again and adjust your query.
Issue search results · GitHub