Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多机多卡预训练速度慢的问题 #19

Closed
RealTong opened this issue Jun 14, 2023 · 1 comment
Closed

多机多卡预训练速度慢的问题 #19

RealTong opened this issue Jun 14, 2023 · 1 comment
Labels
help wanted Extra attention is needed

Comments

@RealTong
Copy link

RealTong commented Jun 14, 2023

环境

主节点:
    GPU: 8 * 3090
    RAM: 500G
node1:
    GPU: 8 * 3090
    RAM: 500G
    
节点之间带宽:100MB

启动脚本

deepspeed --num_gpus 8 --num_nodes 2 --hostfile=host.txt train.py \
    --gradient_accumulation_steps 3 \
    --model_name_or_path /root/.cache/CaMA \
    --model_max_length 1024 \
    --data_path /root/KnowLLM/pretrain/data/dataset \
    --output_dir /root/pretrain-model/ \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 1 \
    --learning_rate 1.5e-5 \
    --warmup_steps 300 \
    --logging_steps 1 \
    --report_to "tensorboard" \
    --gradient_checkpointing True \
    --deepspeed configs/config.json \
    --fp16 True \
    --log_on_each_node False \
    --lr_scheduler_type "cosine" \
    --adam_beta1 0.9 --adam_beta2 0.95 --weight_decay 0.1

deepspeed config

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 0,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 0,
        "stage3_max_reuse_distance": 0,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "fp16": {
        "enabled": true,
        "auto_cast": false,
        "loss_scale": 0,
        "initial_scale_power": 32,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": 1.5e-5,
          "betas": [
            0.9,
            0.95
          ],
          "eps": 1e-8,
          "weight_decay": 0.1
        }
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

问题

使用同样的数据集在单机多卡(8 * 3090)预训练时长 1个小时, 而在多机多卡(一个master,一个node)预训练时长要12个小时多

@MikeDean2367
Copy link
Collaborator

MikeDean2367 commented Jun 14, 2023

您好,我们也遇到过同样的问题,当我们拓展到3台节点时,速度甚至没有单机快,我们的节点的传输方式也是通过网络传输,我们的网络带宽约为千兆,猜测应该是由于节点之间的通信造成的瓶颈。您可以尝试以下方法:

  1. 提高节点之间的网络通信带宽,或者采用物理连接;
  2. 尝试改变batch size或者gradient accumulation。

希望可以帮助到您。

@zxlzr zxlzr added the help wanted Extra attention is needed label Jun 16, 2023
@zxlzr zxlzr closed this as completed Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants