Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. #15

Open
lajihaonange opened this issue Jul 17, 2023 · 2 comments

Comments

@lajihaonange
Copy link

lajihaonange commented Jul 17, 2023

I met this problem when I tried to run the command CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nproc_per_node=4 --nnodes=1 main_diffusion.py --gpu_id 0123 --cfg_path configs/training/diffusion_ffhq512.yaml --save_dir myfolder. Could someone help me solve it?

@zsyOAOA
Copy link
Owner

zsyOAOA commented Jul 17, 2023

I have updated the code. Please have a try:
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nproc_per_node=4 --nnodes=1 main_diffusion.py --cfg_path configs/training/diffusion_ffhq512.yaml --save_dir yourfolder

I suggest you firstly train the model using one GPU, and then turn to the distributed training.

@lajihaonange
Copy link
Author

lajihaonange commented Jul 17, 2023

Thank you for your timely reply. I have used single GPU for training and successfully, I will try your new code right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants