RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. #15

lajihaonange · 2023-07-17T01:23:56Z

I met this problem when I tried to run the command CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nproc_per_node=4 --nnodes=1 main_diffusion.py --gpu_id 0123 --cfg_path configs/training/diffusion_ffhq512.yaml --save_dir myfolder. Could someone help me solve it?

The text was updated successfully, but these errors were encountered:

zsyOAOA · 2023-07-17T06:05:13Z

I have updated the code. Please have a try:
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nproc_per_node=4 --nnodes=1 main_diffusion.py --cfg_path configs/training/diffusion_ffhq512.yaml --save_dir yourfolder

I suggest you firstly train the model using one GPU, and then turn to the distributed training.

lajihaonange · 2023-07-17T11:36:29Z

Thank you for your timely reply. I have used single GPU for training and successfully, I will try your new code right now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. #15

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. #15

lajihaonange commented Jul 17, 2023 •

edited

Loading

zsyOAOA commented Jul 17, 2023

lajihaonange commented Jul 17, 2023 •

edited

Loading

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. #15

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. #15

Comments

lajihaonange commented Jul 17, 2023 • edited Loading

zsyOAOA commented Jul 17, 2023

lajihaonange commented Jul 17, 2023 • edited Loading

lajihaonange commented Jul 17, 2023 •

edited

Loading

lajihaonange commented Jul 17, 2023 •

edited

Loading