You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am having the following problem while reproducing the experiment:
Traceback (most recent call last):
File "run_gpt_backdoor.py", line 1420, in
File "run_gpt_backdoor.py", line 611, in main
File "run_gpt_backdoor.py", line 1032, in train_copier
total_loss += loss.detach().float()
File "/opt/conda/envs/em/lib/python3.8/site-packages/accelerate/accelerator.py", line 1310, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/opt/conda/envs/em/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 156, in backward
self.engine.backward(loss)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1929, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1951, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/opt/conda/envs/em/lib/python3.8/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/opt/conda/envs/em/lib/python3.8/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Float but expected BFloat16
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /workspace/zzx/EmbMarker/src/wandb/offline-run-20231025_113839-wc0wm32b
wandb: Find logs at: ./wandb/offline-run-20231025_113839-wc0wm32b/logs
[2023-10-25 11:41:18,549] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1043906) of binary: /opt/conda/envs/em/bin/python
May I ask why this is? Can you help us out? Thank you very much!
The text was updated successfully, but these errors were encountered:
From the log information, I think you enabled deepspeed to run the experiments, which is different from the settings we provided. Do you follow the steps in README? (i.e. using the same docker image and the same lanching scripts)?
Meanwhile, the log information is strange 🤔️. The total_loss += loss.detach().float() should have no relationship to backward process. If you still have this issue after checking the environmental setup and launching script, you can try to remove total_loss += loss.detach().float() line since it is for logging and irrelevant to the main logic.
Hello, I am having the following problem while reproducing the experiment:
Traceback (most recent call last):
File "run_gpt_backdoor.py", line 1420, in
File "run_gpt_backdoor.py", line 611, in main
File "run_gpt_backdoor.py", line 1032, in train_copier
total_loss += loss.detach().float()
File "/opt/conda/envs/em/lib/python3.8/site-packages/accelerate/accelerator.py", line 1310, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/opt/conda/envs/em/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 156, in backward
self.engine.backward(loss)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1929, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1951, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/opt/conda/envs/em/lib/python3.8/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/opt/conda/envs/em/lib/python3.8/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Float but expected BFloat16
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /workspace/zzx/EmbMarker/src/wandb/offline-run-20231025_113839-wc0wm32b
wandb: Find logs at: ./wandb/offline-run-20231025_113839-wc0wm32b/logs
[2023-10-25 11:41:18,549] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1043906) of binary: /opt/conda/envs/em/bin/python
May I ask why this is? Can you help us out? Thank you very much!
The text was updated successfully, but these errors were encountered: