DeepSpeed MPI error #5288
Sabiha1225
started this conversation in
General
Replies: 1 comment
-
Hi, I got the similar error, I am using windows10, how did you solve this? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
ds_config = {
"fp16": {
"enabled": "auto"
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"betas": [0.8,0.999],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
"pin_memory": True
},
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/mnt/nvme",
"pin_memory": True,
"ratio": 0.3,
"buffer_count": 4,
"fast_init": False
},
"overlap_comm": True,
"contiguous_gradients": True,
},
"tensorboard": {
"enabled": True,
"output_path": "output/ds_logs_125/",
"job_name": "train_bert"
},
"wandb": {
"enabled": True,
"group": "my_group",
"team": "sabiha12",
"project": "deepspeed"
},
"csv_monitor": {
"enabled": True,
"output_path": "output/ds_logs_125/",
"job_name": "train_bert"
},
"steps_per_print": 2000,
"train_batch_size": train_batch_size,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": False,
"dump_state": True
}
This is my configuration. When I am running a LLM model with deep speed I am getting following error.
[2024-03-16 12:09:15,782] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
['labels', 'input_ids', 'attention_mask']
[2024-03-16 12:09:17,968] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-16 12:09:17,968] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
Abort(1090191) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(189)........:
MPID_Init(1561)..............:
MPIDI_OFI_mpi_init_hook(1546):
(unknown)(): Unknown error class
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090191
:
system msg for write_line failure : Bad file descriptor
Abort(1090191) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(189)........:
MPID_Init(1561)..............:
MPIDI_OFI_mpi_init_hook(1546):
(unknown)(): Unknown error class
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090191
:
system msg for write_line failure : Bad file descriptor
Segmentation fault
Kindly suggest some solution.
Beta Was this translation helpful? Give feedback.
All reactions