多机多卡预训练速度慢的问题 #19

RealTong · 2023-06-14T05:24:00Z

环境

主节点：
    GPU: 8 * 3090
    RAM: 500G
node1：
    GPU: 8 * 3090
    RAM: 500G
    
节点之间带宽：100MB

启动脚本

deepspeed --num_gpus 8 --num_nodes 2 --hostfile=host.txt train.py \
    --gradient_accumulation_steps 3 \
    --model_name_or_path /root/.cache/CaMA \
    --model_max_length 1024 \
    --data_path /root/KnowLLM/pretrain/data/dataset \
    --output_dir /root/pretrain-model/ \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 1 \
    --learning_rate 1.5e-5 \
    --warmup_steps 300 \
    --logging_steps 1 \
    --report_to "tensorboard" \
    --gradient_checkpointing True \
    --deepspeed configs/config.json \
    --fp16 True \
    --log_on_each_node False \
    --lr_scheduler_type "cosine" \
    --adam_beta1 0.9 --adam_beta2 0.95 --weight_decay 0.1

deepspeed config

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 0,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 0,
        "stage3_max_reuse_distance": 0,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "fp16": {
        "enabled": true,
        "auto_cast": false,
        "loss_scale": 0,
        "initial_scale_power": 32,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": 1.5e-5,
          "betas": [
            0.9,
            0.95
          ],
          "eps": 1e-8,
          "weight_decay": 0.1
        }
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

问题

使用同样的数据集在单机多卡（8 * 3090）预训练时长 1个小时，而在多机多卡（一个master，一个node）预训练时长要12个小时多

The text was updated successfully, but these errors were encountered:

MikeDean2367 · 2023-06-14T05:46:00Z

您好，我们也遇到过同样的问题，当我们拓展到3台节点时，速度甚至没有单机快，我们的节点的传输方式也是通过网络传输，我们的网络带宽约为千兆，猜测应该是由于节点之间的通信造成的瓶颈。您可以尝试以下方法：

提高节点之间的网络通信带宽，或者采用物理连接；
尝试改变batch size或者gradient accumulation。

希望可以帮助到您。

zxlzr added the help wanted Extra attention is needed label Jun 16, 2023

zxlzr closed this as completed Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多机多卡预训练速度慢的问题 #19

多机多卡预训练速度慢的问题 #19

RealTong commented Jun 14, 2023 •

edited

MikeDean2367 commented Jun 14, 2023 •

edited

多机多卡预训练速度慢的问题 #19

多机多卡预训练速度慢的问题 #19

Comments

RealTong commented Jun 14, 2023 • edited

环境

启动脚本

deepspeed config

问题

MikeDean2367 commented Jun 14, 2023 • edited

RealTong commented Jun 14, 2023 •

edited

MikeDean2367 commented Jun 14, 2023 •

edited