Skip to content
This repository has been archived by the owner on Jan 26, 2022. It is now read-only.

AssertionError: Range subprocess failed (exit code: 1) #63

Closed
tunglm2203 opened this issue May 20, 2018 · 19 comments
Closed

AssertionError: Range subprocess failed (exit code: 1) #63

tunglm2203 opened this issue May 20, 2018 · 19 comments

Comments

@tunglm2203
Copy link

Hi @roytseng-tw
When I evaluating training result, I face a problem like below:

INFO subprocess.py: 129: # ---------------------------------------------------------------------------- #
INFO subprocess.py: 131: stdout of subprocess 0 with range [1, 1250]
INFO subprocess.py: 133: # ---------------------------------------------------------------------------- #
Traceback (most recent call last):
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py", line 4, in
import cv2
ImportError: No module named cv2
Traceback (most recent call last):
File "tools/test_net.py", line 119, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 155, in test_net_on_dataset
args, dataset_name, proposal_file, num_images, output_dir
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 187, in multi_gpu_test_net_on_dataset
args.load_ckpt, args.load_detectron, opts
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 109, in process_in_parallel
log_subprocess_output(i, p, output_dir, tag, start, end)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 147, in log_subprocess_output
assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret)
AssertionError: Range subprocess failed (exit code: 1)

I have installed opencv and successfully imported cv2, but i don't know what is caused to this problem. I have tried solution in https://github.com/facebookresearch/Detectron/issues/349 but it is not helpful. In config file e2e_mask_rcnn_R-50-C4_1x.yaml, I just re-config NUM_GPUS and keep original everything. Can you tell me what is this problem ?

The command that I ran:
python3 tools/test_net.py --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-50-C4_1x.yaml --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth --multi-gpu-testing --output_dir Output_val

System information

  • Operating system: Ubuntu 16.04.4 LTS
  • CUDA version: 8
  • cuDNN version: 5.1
  • GPU models (for all devices if they are not all the same): TITAN X (4 GPUS)
  • python version: 3.5
  • pytorch version: 0.3.1
@roytseng-tw
Copy link
Owner

https://github.com/roytseng-tw/Detectron.pytorch/blob/master/lib/utils/subprocess.py#L71
Change python to python3 may solve your problem.

@tunglm2203
Copy link
Author

Thank @roytseng-tw for fastly reply, I modified as your suggested link, the notify ImportError: No module named cv2 is fixed. But the problem about subprocess is still exist.

DEBUG: Run into test_net_data_set()
INFO subprocess.py: 88: detection range command 0: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 0 1250 --cfg Output_val/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir Output_val --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO subprocess.py: 88: detection range command 1: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 1250 2500 --cfg Output_val/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir Output_val --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO subprocess.py: 88: detection range command 2: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 2500 3750 --cfg Output_val/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir Output_val --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO subprocess.py: 88: detection range command 3: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 3750 5000 --cfg Output_val/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir Output_val --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO subprocess.py: 128: # ---------------------------------------------------------------------------- #
INFO subprocess.py: 130: stdout of subprocess 0 with range [1, 1250]
INFO subprocess.py: 132: # ---------------------------------------------------------------------------- #
INFO test_net.py: 73: Called with args:
INFO test_net.py: 74: Namespace(cfg_file='Output_val/detection_range_config.yaml', dataset=None, load_ckpt='Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth', load_detectron=None, multi_gpu_testing=False, output_dir='Output_val', range=[0, 1250], set_cfgs=['TEST.DATASETS', '("coco_2017_val",)'], vis=False)
Traceback (most recent call last):
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py", line 76, in
assert (torch.cuda.device_count() == 1) ^ bool(args.multi_gpu_testing)
AssertionError
Traceback (most recent call last):
File "tools/test_net.py", line 117, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 155, in test_net_on_dataset
args, dataset_name, proposal_file, num_images, output_dir
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 187, in multi_gpu_test_net_on_dataset
args.load_ckpt, args.load_detectron, opts
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 108, in process_in_parallel
log_subprocess_output(i, p, output_dir, tag, start, end)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 146, in log_subprocess_output
assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret)
AssertionError: Range subprocess failed (exit code: 1)

@tunglm2203
Copy link
Author

tunglm2203 commented May 20, 2018

I have modified the command in https://github.com/roytseng-tw/Detectron.pytorch/blob/master/lib/utils/subprocess.py#L71 by adding --multi-gpu-testing', but there another problem:
INFO test_engine.py: 330: loading checkpoint Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
Traceback (most recent call last):
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py", line 118, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 125, in result_getter
gpu_id=gpu_id
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 253, in test_net
cls_boxes_i, cls_segms_i, cls_keyps_i = im_detect_all(model, im, box_proposals, timers)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test.py", line 70, in im_detect_all
model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE, box_proposals)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test.py", line 139, in im_detect_bbox
return_dict = model(**inputs)
File "/mnt/hdd/tung/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/nn/parallel/data_parallel.py", line 82, in forward
mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/nn/parallel/data_parallel.py", line 82, in
mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
IndexError: list index out of range
Traceback (most recent call last):
File "tools/test_net.py", line 118, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 154, in test_net_on_dataset
args, dataset_name, proposal_file, num_images, output_dir
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 186, in multi_gpu_test_net_on_dataset
args.load_ckpt, args.load_detectron, opts
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 107, in process_in_parallel
log_subprocess_output(i, p, output_dir, tag, start, end)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 145, in log_subprocess_output
assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret)
AssertionError: Range subprocess failed (exit code: 1)

@roytseng-tw
Copy link
Owner

You should checkout the inference section in README.
Specify --multi-gpu-testing if multiple gpus are available.

@tunglm2203
Copy link
Author

Thank @roytseng-tw , actually, I have passed --multi-gpu-testing in my command:
python3 tools/test_net.py --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-50-C4_1x.yaml --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth --multi-gpu-testing --output_dir Output_val

But in https://github.com/roytseng-tw/Detectron.pytorch/blob/1833c71a62e389d2b5f873f40a914c5a47bdd8a2/lib/utils/subprocess.py#L71 , --multi-gpu-testing have not pass to subprocess, I have changed command passed to subprocess, but there another problem like above.

@roytseng-tw
Copy link
Owner

You should not change anything except python --> python3.

@tunglm2203
Copy link
Author

tunglm2203 commented May 21, 2018

Update: @roytseng-tw yes, I keep everything as you said. I have tried evaluate in only one GPU, it run successfully, but when I pass -multi-gpu-testing in my command, and specific gpu device through CUDA_VISIBLE_DEVICES. It still gets error

@roytseng-tw
Copy link
Owner

You should not pass --multi-gpu-testing to subprocesses, and what's the error ?

@tunglm2203
Copy link
Author

Here is error,
INFO test_engine.py: 330: loading checkpoint Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 626/1250 1.590s + 0.056s (eta: 0:17:06)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 636/1250 0.365s + 0.030s (eta: 0:04:02)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 646/1250 0.330s + 0.028s (eta: 0:03:36)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 656/1250 0.337s + 0.031s (eta: 0:03:38)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 666/1250 0.333s + 0.029s (eta: 0:03:31)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 676/1250 0.313s + 0.027s (eta: 0:03:15)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 686/1250 0.314s + 0.025s (eta: 0:03:11)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 696/1250 0.305s + 0.024s (eta: 0:03:02)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 706/1250 0.298s + 0.023s (eta: 0:02:54)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 716/1250 0.307s + 0.024s (eta: 0:02:56)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 726/1250 0.305s + 0.024s (eta: 0:02:52)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 736/1250 0.301s + 0.024s (eta: 0:02:46)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 746/1250 0.302s + 0.023s (eta: 0:02:44)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 756/1250 0.298s + 0.023s (eta: 0:02:38)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 766/1250 0.298s + 0.022s (eta: 0:02:35)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 776/1250 0.296s + 0.022s (eta: 0:02:30)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 786/1250 0.295s + 0.022s (eta: 0:02:27)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 796/1250 0.290s + 0.022s (eta: 0:02:21)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 806/1250 0.293s + 0.023s (eta: 0:02:20)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 816/1250 0.292s + 0.022s (eta: 0:02:16)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 826/1250 0.292s + 0.022s (eta: 0:02:13)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 836/1250 0.293s + 0.022s (eta: 0:02:10)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 846/1250 0.296s + 0.022s (eta: 0:02:08)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 856/1250 0.297s + 0.022s (eta: 0:02:05)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 866/1250 0.296s + 0.022s (eta: 0:02:01)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 876/1250 0.295s + 0.022s (eta: 0:01:58)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 886/1250 0.294s + 0.022s (eta: 0:01:54)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 896/1250 0.292s + 0.021s (eta: 0:01:51)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 906/1250 0.292s + 0.021s (eta: 0:01:47)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 916/1250 0.291s + 0.021s (eta: 0:01:44)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 926/1250 0.292s + 0.022s (eta: 0:01:41)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 936/1250 0.291s + 0.021s (eta: 0:01:38)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 946/1250 0.289s + 0.021s (eta: 0:01:34)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 956/1250 0.288s + 0.021s (eta: 0:01:30)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 966/1250 0.287s + 0.021s (eta: 0:01:27)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 976/1250 0.287s + 0.022s (eta: 0:01:24)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 986/1250 0.287s + 0.021s (eta: 0:01:21)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 996/1250 0.285s + 0.021s (eta: 0:01:17)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1006/1250 0.287s + 0.021s (eta: 0:01:15)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1016/1250 0.287s + 0.021s (eta: 0:01:12)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1026/1250 0.289s + 0.022s (eta: 0:01:09)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1036/1250 0.289s + 0.022s (eta: 0:01:06)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1046/1250 0.289s + 0.022s (eta: 0:01:03)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1056/1250 0.288s + 0.021s (eta: 0:01:00)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1066/1250 0.288s + 0.021s (eta: 0:00:56)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1076/1250 0.287s + 0.021s (eta: 0:00:53)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1086/1250 0.287s + 0.021s (eta: 0:00:50)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1096/1250 0.287s + 0.021s (eta: 0:00:47)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1106/1250 0.288s + 0.021s (eta: 0:00:44)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1116/1250 0.287s + 0.021s (eta: 0:00:41)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1126/1250 0.287s + 0.021s (eta: 0:00:38)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1136/1250 0.288s + 0.021s (eta: 0:00:35)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1146/1250 0.289s + 0.021s (eta: 0:00:32)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1156/1250 0.290s + 0.021s (eta: 0:00:29)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1166/1250 0.290s + 0.021s (eta: 0:00:26)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1176/1250 0.289s + 0.021s (eta: 0:00:22)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1186/1250 0.289s + 0.021s (eta: 0:00:19)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1196/1250 0.289s + 0.021s (eta: 0:00:16)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1206/1250 0.289s + 0.021s (eta: 0:00:13)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1216/1250 0.292s + 0.022s (eta: 0:00:10)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1226/1250 0.291s + 0.022s (eta: 0:00:07)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1236/1250 0.291s + 0.022s (eta: 0:00:04)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1246/1250 0.291s + 0.022s (eta: 0:00:01)
INFO test_engine.py: 314: Wrote detections to: /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/test_output/detection_range_625_1250.pkl

INFO test_engine.py: 211: Wrote detections to: /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/test_output/detections.pkl
INFO test_engine.py: 161: Total inference time: 212.193s
INFO task_evaluation.py: 75: Evaluating detections
Traceback (most recent call last):
File "tools/test_net.py", line 118, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 163, in test_net_on_dataset
dataset, all_boxes, all_segms, all_keyps, output_dir
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/task_evaluation.py", line 59, in evaluate_all
dataset, all_boxes, output_dir, use_matlab=use_matlab
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/task_evaluation.py", line 79, in evaluate_boxes
dataset, all_boxes, output_dir, use_salt=not_comp, cleanup=not_comp
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 135, in evaluate_boxes
_write_coco_bbox_results_file(json_dataset, all_boxes, res_file)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 160, in _write_coco_bbox_results_file
json_dataset, all_boxes[cls_ind], cat_id))
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 171, in _coco_bbox_results_one_category
assert len(boxes) == len(image_ids)
AssertionError

To clear,
In config file: e2e_mask_rcnn_R-50-C4_1x.yaml, I set NUM_GPUS=2, and I pass gpu id through CUDA_VISIBLE_DEVICES=5,6

@roytseng-tw
Copy link
Owner

roytseng-tw commented May 21, 2018

First, you don't need to change anything in the config file if you use CUDA_VISIBLE_DEVICES to set available gpus.
Second, from your log, It's like that you were using 8 gpus to run the testing (range [626, 1250] of 5000) instead of 2.

Below is my deduction:
You are on a machine of 8 gpus (5000/625) , and you didn't successfully set CUDA_VISIBLE_DEVICES.

CUDA_VISIBLE_DEVICES is a environment variable checked by cuda driver. To set it, you can either do

  1. export CUDA_VISIBLE_DEVICES=5,6
  2. CUDA_VISIBLE_DEVICES=5,6 python tools/test_net.py ...

@tunglm2203
Copy link
Author

Yes, I am on machine with 8 GPUs, but I am only allowed to run on 2 GPUs, so I want to use only 2 GPUs 5 and 6. I ran as you said: CUDA_VISIBLE_DEVICES=5,6 && python tools/test_net.py .... Here is tail of my log

INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1246/1250 0.292s + 0.021s (eta: 0:00:01)
INFO test_engine.py: 314: Wrote detections to: /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detection_range_625_1250.pkl

INFO test_engine.py: 211: Wrote detections to: /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detections.pkl
INFO test_engine.py: 161: Total inference time: 212.159s
INFO task_evaluation.py: 75: Evaluating detections
Traceback (most recent call last):
File "tools/test_net.py", line 111, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 163, in test_net_on_dataset
dataset, all_boxes, all_segms, all_keyps, output_dir
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/task_evaluation.py", line 59, in evaluate_all
dataset, all_boxes, output_dir, use_matlab=use_matlab
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/task_evaluation.py", line 79, in evaluate_boxes
dataset, all_boxes, output_dir, use_salt=not_comp, cleanup=not_comp
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 135, in evaluate_boxes
_write_coco_bbox_results_file(json_dataset, all_boxes, res_file)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 160, in _write_coco_bbox_results_file
json_dataset, all_boxes[cls_ind], cat_id))
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 171, in _coco_bbox_results_one_category
assert len(boxes) == len(image_ids)
AssertionError

@roytseng-tw
Copy link
Owner

roytseng-tw commented May 21, 2018

I find a weird thing in your log range [626, 1250] of 5000: 1246/1250: length of dataset and indices do not match ! Are you using a clean code ?

@tunglm2203
Copy link
Author

tunglm2203 commented May 21, 2018

I only add this line in head of test_net.py file and keep everything:
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="2,3"

I see that process is divided into 2 subprocess, first range is [1, 2500], but it fails in assert like log below.

Here my new full log:
INFO test_net.py: 70: Called with args:
INFO test_net.py: 71: Namespace(cfg_file='configs/e2e_mask_rcnn_R-50-C4_1x.yaml', dataset='coco2017', load_ckpt='/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth', load_detectron=None, multi_gpu_testing=True, output_dir=None, range=None, set_cfgs=[], vis=False)
INFO test_net.py: 81: Automatically set output directory to /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test
INFO test_net.py: 102: Testing with config:
INFO test_net.py: 103: {'BBOX_XFORM_CLIP': 4.135166556742356,
'CROP_RESIZE_WITH_MAX_POOL': True,
'CUDA': False,
'DATA_DIR': '/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/data',
'DATA_LOADER': {'NUM_THREADS': 4},
'DEBUG': False,
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXPECTED_RESULTS': [],
'EXPECTED_RESULTS_ATOL': 0.005,
'EXPECTED_RESULTS_EMAIL': '',
'EXPECTED_RESULTS_RTOL': 0.1,
'FAST_RCNN': {'MLP_HEAD_DIM': 1024,
'ROI_BOX_HEAD': 'ResNet.ResNet_roi_conv5_head',
'ROI_XFORM_METHOD': 'RoIAlign',
'ROI_XFORM_RESOLUTION': 14,
'ROI_XFORM_SAMPLING_RATIO': 0},
'FPN': {'COARSEST_STRIDE': 32,
'DIM': 256,
'EXTRA_CONV_LEVELS': False,
'FPN_ON': False,
'MULTILEVEL_ROIS': False,
'MULTILEVEL_RPN': False,
'ROI_CANONICAL_LEVEL': 4,
'ROI_CANONICAL_SCALE': 224,
'ROI_MAX_LEVEL': 5,
'ROI_MIN_LEVEL': 2,
'RPN_ANCHOR_START_SIZE': 32,
'RPN_ASPECT_RATIOS': (0.5, 1, 2),
'RPN_COLLECT_SCALE': 1,
'RPN_MAX_LEVEL': 6,
'RPN_MIN_LEVEL': 2,
'ZERO_INIT_LATERAL': False},
'KRCNN': {'CONV_HEAD_DIM': 256,
'CONV_HEAD_KERNEL': 3,
'CONV_INIT': 'GaussianFill',
'DECONV_DIM': 256,
'DECONV_KERNEL': 4,
'DILATION': 1,
'HEATMAP_SIZE': -1,
'INFERENCE_MIN_SIZE': 0,
'KEYPOINT_CONFIDENCE': 'bbox',
'LOSS_WEIGHT': 1.0,
'MIN_KEYPOINT_COUNT_FOR_VALID_MINIBATCH': 20,
'NMS_OKS': False,
'NORMALIZE_BY_VISIBLE_KEYPOINTS': True,
'NUM_KEYPOINTS': -1,
'NUM_STACKED_CONVS': 8,
'ROI_KEYPOINTS_HEAD': '',
'ROI_XFORM_METHOD': 'RoIAlign',
'ROI_XFORM_RESOLUTION': 7,
'ROI_XFORM_SAMPLING_RATIO': 0,
'UP_SCALE': -1,
'USE_DECONV': False,
'USE_DECONV_OUTPUT': False},
'MATLAB': 'matlab',
'MODEL': {'BBOX_REG_WEIGHTS': (10.0, 10.0, 5.0, 5.0),
'CLS_AGNOSTIC_BBOX_REG': False,
'CONV_BODY': 'ResNet.ResNet50_conv4_body',
'FASTER_RCNN': True,
'KEYPOINTS_ON': False,
'LOAD_IMAGENET_PRETRAINED_WEIGHTS': True,
'MASK_ON': True,
'NUM_CLASSES': 81,
'RPN_ONLY': False,
'SHARE_RES5': True,
'TYPE': 'generalized_rcnn',
'UNSUPERVISED_POSE': False},
'MRCNN': {'CLS_SPECIFIC_MASK': True,
'CONV_INIT': 'MSRAFill',
'DILATION': 1,
'DIM_REDUCED': 256,
'MEMORY_EFFICIENT_LOSS': True,
'RESOLUTION': 14,
'ROI_MASK_HEAD': 'mask_rcnn_heads.mask_rcnn_fcn_head_v0upshare',
'ROI_XFORM_METHOD': 'RoIAlign',
'ROI_XFORM_RESOLUTION': 14,
'ROI_XFORM_SAMPLING_RATIO': 0,
'THRESH_BINARIZE': 0.5,
'UPSAMPLE_RATIO': 1,
'USE_FC_OUTPUT': False,
'WEIGHT_LOSS_MASK': 1.0},
'NUM_GPUS': 8,
'OUTPUT_DIR': 'Outputs',
'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]),
'POOLING_MODE': 'crop',
'POOLING_SIZE': 7,
'PYTORCH_VERSION_LESS_THAN_040': True,
'RESNETS': {'FREEZE_AT': 2,
'IMAGENET_PRETRAINED_WEIGHTS': 'data/pretrained_model/resnet50_caffe.pth',
'NUM_GROUPS': 1,
'RES5_DILATION': 1,
'STRIDE_1X1': True,
'TRANS_FUNC': 'bottleneck_transformation',
'WIDTH_PER_GROUP': 64},
'RETINANET': {'ANCHOR_SCALE': 4,
'ASPECT_RATIOS': (0.5, 1.0, 2.0),
'BBOX_REG_BETA': 0.11,
'BBOX_REG_WEIGHT': 1.0,
'CLASS_SPECIFIC_BBOX': False,
'INFERENCE_TH': 0.05,
'LOSS_ALPHA': 0.25,
'LOSS_GAMMA': 2.0,
'NEGATIVE_OVERLAP': 0.4,
'NUM_CONVS': 4,
'POSITIVE_OVERLAP': 0.5,
'PRE_NMS_TOP_N': 1000,
'PRIOR_PROB': 0.01,
'RETINANET_ON': False,
'SCALES_PER_OCTAVE': 3,
'SHARE_CLS_BBOX_TOWER': False,
'SOFTMAX': False},
'RFCN': {'PS_GRID_SIZE': 3},
'RNG_SEED': 3,
'ROOT_DIR': '/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch',
'RPN': {'ASPECT_RATIOS': (0.5, 1, 2),
'CLS_ACTIVATION': 'sigmoid',
'OUT_DIM': 512,
'OUT_DIM_AS_IN_DIM': True,
'RPN_ON': True,
'SIZES': (32, 64, 128, 256, 512),
'STRIDE': 16},
'SOLVER': {'BASE_LR': 0.01,
'BIAS_DOUBLE_LR': True,
'BIAS_WEIGHT_DECAY': False,
'GAMMA': 0.1,
'LOG_LR_CHANGE_THRESHOLD': 1.1,
'LRS': [],
'LR_POLICY': 'steps_with_decay',
'MAX_ITER': 180000,
'MOMENTUM': 0.9,
'SCALE_MOMENTUM': True,
'SCALE_MOMENTUM_THRESHOLD': 1.1,
'STEPS': [0, 120000, 160000],
'STEP_SIZE': 30000,
'TYPE': 'SGD',
'WARM_UP_FACTOR': 0.3333333333333333,
'WARM_UP_ITERS': 500,
'WARM_UP_METHOD': 'linear',
'WEIGHT_DECAY': 0.0001},
'TEST': {'BBOX_AUG': {'AREA_TH_HI': 32400,
'AREA_TH_LO': 2500,
'ASPECT_RATIOS': (),
'ASPECT_RATIO_H_FLIP': False,
'COORD_HEUR': 'UNION',
'ENABLED': False,
'H_FLIP': False,
'MAX_SIZE': 4000,
'SCALES': (),
'SCALE_H_FLIP': False,
'SCALE_SIZE_DEP': False,
'SCORE_HEUR': 'UNION'},
'BBOX_REG': True,
'BBOX_VOTE': {'ENABLED': False,
'SCORING_METHOD': 'ID',
'SCORING_METHOD_BETA': 1.0,
'VOTE_TH': 0.8},
'COMPETITION_MODE': True,
'DATASETS': ('coco_2017_val',),
'DETECTIONS_PER_IM': 100,
'FORCE_JSON_DATASET_EVAL': False,
'KPS_AUG': {'AREA_TH': 32400,
'ASPECT_RATIOS': (),
'ASPECT_RATIO_H_FLIP': False,
'ENABLED': False,
'HEUR': 'HM_AVG',
'H_FLIP': False,
'MAX_SIZE': 4000,
'SCALES': (),
'SCALE_H_FLIP': False,
'SCALE_SIZE_DEP': False},
'MASK_AUG': {'AREA_TH': 32400,
'ASPECT_RATIOS': (),
'ASPECT_RATIO_H_FLIP': False,
'ENABLED': False,
'HEUR': 'SOFT_AVG',
'H_FLIP': False,
'MAX_SIZE': 4000,
'SCALES': (),
'SCALE_H_FLIP': False,
'SCALE_SIZE_DEP': False},
'MAX_SIZE': 1333,
'NMS': 0.5,
'PRECOMPUTED_PROPOSALS': False,
'PROPOSAL_FILES': (),
'PROPOSAL_LIMIT': 2000,
'RPN_MIN_SIZE': 0,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 1000,
'RPN_PRE_NMS_TOP_N': 6000,
'SCALE': 800,
'SCORE_THRESH': 0.05,
'SOFT_NMS': {'ENABLED': False, 'METHOD': 'linear', 'SIGMA': 0.5}},
'TRAIN': {'ASPECT_CROPPING': False,
'ASPECT_GROUPING': True,
'ASPECT_HI': 2,
'ASPECT_LO': 0.5,
'BATCH_SIZE_PER_IM': 512,
'BBOX_INSIDE_WEIGHTS': (1.0, 1.0, 1.0, 1.0),
'BBOX_NORMALIZE_MEANS': (0.0, 0.0, 0.0, 0.0),
'BBOX_NORMALIZE_STDS': (0.1, 0.1, 0.2, 0.2),
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': False,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'CROWD_FILTER_THRESH': 0.7,
'DATASETS': (),
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'FREEZE_CONV_BODY': False,
'GT_MIN_AREA': -1,
'IMS_PER_BATCH': 1,
'MAX_SIZE': 1333,
'PROPOSAL_FILES': (),
'RPN_BATCH_SIZE_PER_IM': 256,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 0,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'RPN_STRADDLE_THRESH': 0,
'SCALES': (800,),
'SNAPSHOT_ITERS': 20000,
'USE_FLIPPED': True},
'VIS': False,
'VIS_TH': 0.9}
loading annotations into memory...
Done (t=0.72s)
creating index...
index created!
INFO subprocess.py: 87: detection range command 0: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 0 2500 --cfg /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test --load_ckpt /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO subprocess.py: 87: detection range command 1: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 2500 5000 --cfg /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test --load_ckpt /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO subprocess.py: 127: # ---------------------------------------------------------------------------- #
INFO subprocess.py: 129: stdout of subprocess 0 with range [1, 2500]
INFO subprocess.py: 131: # ---------------------------------------------------------------------------- #
INFO test_net.py: 70: Called with args:
INFO test_net.py: 71: Namespace(cfg_file='/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detection_range_config.yaml', dataset=None, load_ckpt='/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth', load_detectron=None, multi_gpu_testing=False, output_dir='/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test', range=[0, 2500], set_cfgs=['TEST.DATASETS', '("coco_2017_val",)'], vis=False)
Traceback (most recent call last):
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py", line 73, in
assert (torch.cuda.device_count() == 1) ^ bool(args.multi_gpu_testing)
AssertionError
Traceback (most recent call last):
File "tools/test_net.py", line 114, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 154, in test_net_on_dataset
args, dataset_name, proposal_file, num_images, output_dir
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 186, in multi_gpu_test_net_on_dataset
args.load_ckpt, args.load_detectron, opts
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 107, in process_in_parallel
log_subprocess_output(i, p, output_dir, tag, start, end)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 145, in log_subprocess_output
assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret)
AssertionError: Range subprocess failed (exit code: 1)

@roytseng-tw
Copy link
Owner

You should not add os.environ["CUDA_VISIBLE_DEVICES"]="2,3" to test_net.py

@tunglm2203
Copy link
Author

tunglm2203 commented May 21, 2018

@roytseng-tw if I don't add this, the command CUDA_VISIBLE_DEVICES=5,6 && python tools/test_net.py... may be not successfully, it still detect 8 GPUs.

@roytseng-tw
Copy link
Owner

What's the output of this for you

CUDA_VISIBLE_DEVICES=5,6 python -c "import torch; print(torch.cuda.device_count())"

@tunglm2203
Copy link
Author

tunglm2203 commented May 21, 2018

@roytseng-tw Output is 2

@tunglm2203
Copy link
Author

I try command:
CUDA_VISIBLE_DEVICES=5,6 python tools/test_net.py ... instead of CUDA_VISIBLE_DEVICES=5,6 && python tools/test_net.py ...
It may be helpful, I see process divide into 2 process with first range is [1, 2500], wait until run to next range ...

@tunglm2203
Copy link
Author

@roytseng-tw I have run sucessfully, thank you, I don't know why it not detect GPU device ID when I use && to concatenate command. Once again, thank you so much !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants