Open
Description
Using accelerate launch
to run the training script. When the model is loaded with device_map
set to auto
, before training error will be occurred since the trainer is trying to prepare the PP model into DDP.
[llm toolkit]: *** Train ***
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/sdb/user_1/workspace/llm-toolkit/tmp/test_prepare/finetune.py", line 54, i
n <module>
[rank0]: train(
[rank0]: File "/mnt/sdb/user_1/workspace/llm-toolkit/llmtoolkit/train.py", line 144, in train
[rank0]: train_result = trainer.train()
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/transform
ers/trainer.py", line 2245, in train
[rank0]: return inner_training_loop(
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/transform
ers/trainer.py", line 2374, in _inner_training_loop
[rank0]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/accelerat
e/accelerator.py", line 1446, in prepare
[rank0]: result = tuple(
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/accelerat
e/accelerator.py", line 1447, in <genexpr>
[rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device
_placement)
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/accelerat
e/accelerator.py", line 1289, in _prepare_one
[rank0]: return self.prepare_model(obj, device_placement=device_placement)
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/accelerat
e/accelerator.py", line 1595, in prepare_model
[rank0]: model = torch.nn.parallel.DistributedDataParallel(
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/torch/nn/
parallel/distributed.py", line 827, in __init__
[rank0]: _sync_module_states(
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/torch/dis
tributed/utils.py", line 317, in _sync_module_states
[rank0]: _sync_params_and_buffers(process_group, module_states, broadcast_bucket_size, src)
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/torch/dis
tributed/utils.py", line 328, in _sync_params_and_buffers
[rank0]: dist._broadcast_coalesced(
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/torch/_co
mpile.py", line 32, in inner
[rank0]: return disable_fn(*args, **kwargs)
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/torch/_dy
namo/eval_frame.py", line 632, in _fn
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/torch/dis
tributed/tensor/_api.py", line 340, in __torch_dispatch__
[rank0]: return DTensor._op_dispatcher.dispatch(
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/torch/dis
tributed/tensor/_dispatch.py", line 166, in dispatch
[rank0]: op_info = self.unwrap_to_op_info(op_call, args, kwargs)
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/torch/dis
tributed/tensor/_dispatch.py", line 371, in unwrap_to_op_info
[rank0]: self._try_replicate_spec_for_scalar_tensor(op_call, arg, mesh)
[rank0]: File "/mnt/sdb/user_1/anaconda3/envs/workspace/lib/python3.10/site-packages/torch/dis
tributed/tensor/_dispatch.py", line 470, in _try_replicate_spec_for_scalar_tensor
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: aten.cat.default: got mixed torch.Tensor and DTensor, need to convert all torch.
Tensor to DTensor before calling distributed operators!
[rank0]:[W527 23:23:59.836288935 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT bee
n destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call de
stroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare c
ases this process can exit before this point and block the progress of another member of the process gr
oup. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (
function operator())
W0527 23:24:01.485000 3226366 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sendi
ng process 3226692 closing signal SIGTERM
E0527 23:24:01.752000 3226366 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] faile
d (exitcode: 1) local_rank: 0 (pid: 3226691) of binary: /mnt/sdb/user_1/anaconda3/envs/workspace
/bin/python