We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
主节点: GPU: 8 * 3090 RAM: 500G node1: GPU: 8 * 3090 RAM: 500G 节点之间带宽:100MB
deepspeed --num_gpus 8 --num_nodes 2 --hostfile=host.txt train.py \ --gradient_accumulation_steps 3 \ --model_name_or_path /root/.cache/CaMA \ --model_max_length 1024 \ --data_path /root/KnowLLM/pretrain/data/dataset \ --output_dir /root/pretrain-model/ \ --num_train_epochs 1 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 100 \ --save_total_limit 1 \ --learning_rate 1.5e-5 \ --warmup_steps 300 \ --logging_steps 1 \ --report_to "tensorboard" \ --gradient_checkpointing True \ --deepspeed configs/config.json \ --fp16 True \ --log_on_each_node False \ --lr_scheduler_type "cosine" \ --adam_beta1 0.9 --adam_beta2 0.95 --weight_decay 0.1
{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 0, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 0, "stage3_max_reuse_distance": 0, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": true, "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": 1.5e-5, "betas": [ 0.9, 0.95 ], "eps": 1e-8, "weight_decay": 0.1 } }, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
使用同样的数据集在单机多卡(8 * 3090)预训练时长 1个小时, 而在多机多卡(一个master,一个node)预训练时长要12个小时多
The text was updated successfully, but these errors were encountered:
您好,我们也遇到过同样的问题,当我们拓展到3台节点时,速度甚至没有单机快,我们的节点的传输方式也是通过网络传输,我们的网络带宽约为千兆,猜测应该是由于节点之间的通信造成的瓶颈。您可以尝试以下方法:
希望可以帮助到您。
Sorry, something went wrong.
No branches or pull requests
环境
启动脚本
deepspeed config
问题
使用同样的数据集在单机多卡(8 * 3090)预训练时长 1个小时, 而在多机多卡(一个master,一个node)预训练时长要12个小时多
The text was updated successfully, but these errors were encountered: