RuntimeError: Found dtype Float but expected BFloat16 #2

zzx528 · 2023-10-25T03:44:48Z

Hello, I am having the following problem while reproducing the experiment:
Traceback (most recent call last):
File "run_gpt_backdoor.py", line 1420, in
File "run_gpt_backdoor.py", line 611, in main
File "run_gpt_backdoor.py", line 1032, in train_copier
total_loss += loss.detach().float()
File "/opt/conda/envs/em/lib/python3.8/site-packages/accelerate/accelerator.py", line 1310, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/opt/conda/envs/em/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 156, in backward
self.engine.backward(loss)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1929, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1951, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/opt/conda/envs/em/lib/python3.8/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/opt/conda/envs/em/lib/python3.8/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Float but expected BFloat16
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /workspace/zzx/EmbMarker/src/wandb/offline-run-20231025_113839-wc0wm32b
wandb: Find logs at: ./wandb/offline-run-20231025_113839-wc0wm32b/logs
[2023-10-25 11:41:18,549] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1043906) of binary: /opt/conda/envs/em/bin/python

May I ask why this is? Can you help us out? Thank you very much!

yjw1029 · 2023-10-26T00:23:46Z

From the log information, I think you enabled deepspeed to run the experiments, which is different from the settings we provided. Do you follow the steps in README? (i.e. using the same docker image and the same lanching scripts)?

yjw1029 · 2023-10-26T00:40:42Z

Meanwhile, the log information is strange 🤔️. The total_loss += loss.detach().float() should have no relationship to backward process. If you still have this issue after checking the environmental setup and launching script, you can try to remove total_loss += loss.detach().float() line since it is for logging and irrelevant to the main logic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Found dtype Float but expected BFloat16 #2

RuntimeError: Found dtype Float but expected BFloat16 #2

zzx528 commented Oct 25, 2023

yjw1029 commented Oct 26, 2023

yjw1029 commented Oct 26, 2023

RuntimeError: Found dtype Float but expected BFloat16 #2

RuntimeError: Found dtype Float but expected BFloat16 #2

Comments

zzx528 commented Oct 25, 2023

yjw1029 commented Oct 26, 2023

yjw1029 commented Oct 26, 2023