Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems during training #5

Closed
ZYCheng777 opened this issue Apr 8, 2021 · 7 comments
Closed

Problems during training #5

ZYCheng777 opened this issue Apr 8, 2021 · 7 comments

Comments

@ZYCheng777
Copy link

Hello, I only have one GPU, when I try to train with NYU dataset, enter the following command

CUDA_VISIBLE_DEVICES=0 python bts_main.py arguments_train_nyu.txt

Found the following problems

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/ace/PycharmProjects/TransDepth-main/pytorch/bts_main.py", line 439, in main_worker
var_sum = np.sum(var_sum)
File "<array_function internals>", line 6, in sum
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2248, in sum
initial=initial, where=where)
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/torch/tensor.py", line 621, in array
return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I found some solutions but none of them worked,How can I solve this?I sincerely look forward to your reply

@ygjwd12345
Copy link
Owner

I guess the problem maybe is you remove "--multiprocessing_distributed" in argument. Can you run CUDA_VISIBLE_DEVICES=0 python bts_main.py arguments_train_nyu_debug.txt?

@ZYCheng777
Copy link
Author

Thank you for your reply,I run

CUDA_VISIBLE_DEVICES=0 python bts_main.py arguments_train_nyu_debug.txt

The same problem occurred

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/ace/PycharmProjects/TransDepth-main/pytorch/bts_main.py", line 439, in main_worker
var_sum = np.sum(var_sum)
File "<array_function internals>", line 6, in sum
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2248, in sum
initial=initial, where=where)
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
File "/home/ace/Anaconda3/envs/TransDepth/lib/python3.7/site-packages/torch/tensor.py", line 621, in array
return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

And I did not modify --multiprocessing_distributed

@ZYCheng777
Copy link
Author

This is my arguments_train_nyu_debug.txt, I just modified the data path.

--mode train
--model_name bts_nyu_v2_pytorch_att
--encoder resnet101_bts
--dataset nyu
--data_path /home/ace/data/nyu-dataset/nyu_depth_v2/sync/
--gt_path /home/ace/data/nyu-dataset/nyu_depth_v2/sync/
--filenames_file ../train_test_inputs/nyudepthv2_train_files_with_gt.txt
--batch_size 1
--num_epochs 50
--learning_rate 1e-4
--weight_decay 1e-2
--adam_eps 1e-3
--num_threads 1
--input_height 416
--input_width 512
--max_depth 10
--do_random_rotate
--degree 2.5
--log_directory ./models
--multiprocessing_distributed
--dist_url tcp://127.0.0.1:2346
--att_rank 3

--log_freq 10
--do_online_eval
--eval_freq 10
--data_path_eval /home/ace/data/nyu-dataset/nyu_depth_v2/official_splits/test/
--gt_path_eval /home/ace/data/nyu-dataset/nyu_depth_v2/official_splits/test/
--filenames_file_eval ../train_test_inputs/nyudepthv2_test_files_with_gt.txt
--min_depth_eval 1e-3
--max_depth_eval 10
--eval_summary_directory ./models/eval/
--eigen_crop

@ygjwd12345
Copy link
Owner

Would you mind share your environment?

@ZYCheng777
Copy link
Author

Of course, my GPU is RTX 3090 cuda 11.1 ,my environment is

python=3.7.10
torch=1.8.0
torchvision=0.9.0
numpy=1.20.2
tqdm=4.59.0
tensorboard=2.4.1
tensorboardX=2.1
ml-collections=0.1.0
medpy=0.4.0
SimpleITK=2.0.2
scipy=1.6.2
h5py=3.1.0
wandb=0.10.20

@ygjwd12345
Copy link
Owner

Of course, my GPU is RTX 3090 cuda 11.1 ,my environment is

python=3.7.10
torch=1.8.0
torchvision=0.9.0
numpy=1.20.2
tqdm=4.59.0
tensorboard=2.4.1
tensorboardX=2.1
ml-collections=0.1.0
medpy=0.4.0
SimpleITK=2.0.2
scipy=1.6.2
h5py=3.1.0
wandb=0.10.20

I recommend torch==1.5.1 cuda 10.1 or pytorch==1.7.1 cuda11.0. I never try torch>1.7

@ZYCheng777
Copy link
Author

After I installed pytorch 1.7.1, the problem was solved, thank you very much for your answers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants