Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Found dtype Float but expected BFloat16 #2

Open
zzx528 opened this issue Oct 25, 2023 · 2 comments
Open

RuntimeError: Found dtype Float but expected BFloat16 #2

zzx528 opened this issue Oct 25, 2023 · 2 comments

Comments

@zzx528
Copy link

zzx528 commented Oct 25, 2023

Hello, I am having the following problem while reproducing the experiment:
Traceback (most recent call last):
File "run_gpt_backdoor.py", line 1420, in
File "run_gpt_backdoor.py", line 611, in main
File "run_gpt_backdoor.py", line 1032, in train_copier
total_loss += loss.detach().float()
File "/opt/conda/envs/em/lib/python3.8/site-packages/accelerate/accelerator.py", line 1310, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/opt/conda/envs/em/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 156, in backward
self.engine.backward(loss)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1929, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1951, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/opt/conda/envs/em/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/opt/conda/envs/em/lib/python3.8/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/opt/conda/envs/em/lib/python3.8/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Float but expected BFloat16
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /workspace/zzx/EmbMarker/src/wandb/offline-run-20231025_113839-wc0wm32b
wandb: Find logs at: ./wandb/offline-run-20231025_113839-wc0wm32b/logs
[2023-10-25 11:41:18,549] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1043906) of binary: /opt/conda/envs/em/bin/python

May I ask why this is? Can you help us out? Thank you very much!

@yjw1029
Copy link
Owner

yjw1029 commented Oct 26, 2023

From the log information, I think you enabled deepspeed to run the experiments, which is different from the settings we provided. Do you follow the steps in README? (i.e. using the same docker image and the same lanching scripts)?

@yjw1029
Copy link
Owner

yjw1029 commented Oct 26, 2023

Meanwhile, the log information is strange 🤔️. The total_loss += loss.detach().float() should have no relationship to backward process. If you still have this issue after checking the environmental setup and launching script, you can try to remove total_loss += loss.detach().float() line since it is for logging and irrelevant to the main logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants