Problems during training #5

ZYCheng777 · 2021-04-08T10:58:40Z

Hello, I only have one GPU, when I try to train with NYU dataset, enter the following command

CUDA_VISIBLE_DEVICES=0 python bts_main.py arguments_train_nyu.txt

Found the following problems

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/ace/PycharmProjects/TransDepth-main/pytorch/bts_main.py", line 439, in main_worker
var_sum = np.sum(var_sum)
File "<array_function internals>", line 6, in sum
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2248, in sum
initial=initial, where=where)
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/torch/tensor.py", line 621, in array
return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I found some solutions but none of them worked，How can I solve this？I sincerely look forward to your reply

ygjwd12345 · 2021-04-08T11:07:27Z

I guess the problem maybe is you remove "--multiprocessing_distributed" in argument. Can you run CUDA_VISIBLE_DEVICES=0 python bts_main.py arguments_train_nyu_debug.txt?

ZYCheng777 · 2021-04-08T12:49:44Z

Thank you for your reply,I run

CUDA_VISIBLE_DEVICES=0 python bts_main.py arguments_train_nyu_debug.txt

The same problem occurred

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/ace/PycharmProjects/TransDepth-main/pytorch/bts_main.py", line 439, in main_worker
var_sum = np.sum(var_sum)
File "<array_function internals>", line 6, in sum
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2248, in sum
initial=initial, where=where)
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/torch/tensor.py", line 621, in array
return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

And I did not modify --multiprocessing_distributed

ZYCheng777 · 2021-04-08T12:54:34Z

This is my arguments_train_nyu_debug.txt, I just modified the data path.

--mode train
--model_name bts_nyu_v2_pytorch_att
--encoder resnet101_bts
--dataset nyu
--data_path /home/ace/data/nyu-dataset/nyu_depth_v2/sync/
--gt_path /home/ace/data/nyu-dataset/nyu_depth_v2/sync/
--filenames_file ../train_test_inputs/nyudepthv2_train_files_with_gt.txt
--batch_size 1
--num_epochs 50
--learning_rate 1e-4
--weight_decay 1e-2
--adam_eps 1e-3
--num_threads 1
--input_height 416
--input_width 512
--max_depth 10
--do_random_rotate
--degree 2.5
--log_directory ./models
--multiprocessing_distributed
--dist_url tcp://127.0.0.1:2346
--att_rank 3

--log_freq 10
--do_online_eval
--eval_freq 10
--data_path_eval /home/ace/data/nyu-dataset/nyu_depth_v2/official_splits/test/
--gt_path_eval /home/ace/data/nyu-dataset/nyu_depth_v2/official_splits/test/
--filenames_file_eval ../train_test_inputs/nyudepthv2_test_files_with_gt.txt
--min_depth_eval 1e-3
--max_depth_eval 10
--eval_summary_directory ./models/eval/
--eigen_crop

ygjwd12345 · 2021-04-08T13:33:03Z

Would you mind share your environment?

ZYCheng777 · 2021-04-09T01:44:28Z

Of course, my GPU is RTX 3090 cuda 11.1 ,my environment is

python=3.7.10
torch=1.8.0
torchvision=0.9.0
numpy=1.20.2
tqdm=4.59.0
tensorboard=2.4.1
tensorboardX=2.1
ml-collections=0.1.0
medpy=0.4.0
SimpleITK=2.0.2
scipy=1.6.2
h5py=3.1.0
wandb=0.10.20

ygjwd12345 · 2021-04-09T07:28:54Z

Of course, my GPU is RTX 3090 cuda 11.1 ,my environment is

python=3.7.10
torch=1.8.0
torchvision=0.9.0
numpy=1.20.2
tqdm=4.59.0
tensorboard=2.4.1
tensorboardX=2.1
ml-collections=0.1.0
medpy=0.4.0
SimpleITK=2.0.2
scipy=1.6.2
h5py=3.1.0
wandb=0.10.20

I recommend torch==1.5.1 cuda 10.1 or pytorch==1.7.1 cuda11.0. I never try torch>1.7

ZYCheng777 · 2021-04-09T09:09:23Z

After I installed pytorch 1.7.1, the problem was solved, thank you very much for your answers！

ygjwd12345 closed this as completed Apr 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems during training #5

Problems during training #5

ZYCheng777 commented Apr 8, 2021

ygjwd12345 commented Apr 8, 2021

ZYCheng777 commented Apr 8, 2021

ZYCheng777 commented Apr 8, 2021

ygjwd12345 commented Apr 8, 2021

ZYCheng777 commented Apr 9, 2021

ygjwd12345 commented Apr 9, 2021

ZYCheng777 commented Apr 9, 2021

Problems during training #5

Problems during training #5

Comments

ZYCheng777 commented Apr 8, 2021

ygjwd12345 commented Apr 8, 2021

ZYCheng777 commented Apr 8, 2021

ZYCheng777 commented Apr 8, 2021

ygjwd12345 commented Apr 8, 2021

ZYCheng777 commented Apr 9, 2021

ygjwd12345 commented Apr 9, 2021

ZYCheng777 commented Apr 9, 2021