AssertionError: Range subprocess failed (exit code: 1) #63

tunglm2203 · 2018-05-20T12:26:56Z

Hi @roytseng-tw
When I evaluating training result, I face a problem like below:

INFO subprocess.py: 129: # ---------------------------------------------------------------------------- #
INFO subprocess.py: 131: stdout of subprocess 0 with range [1, 1250]
INFO subprocess.py: 133: # ---------------------------------------------------------------------------- #
Traceback (most recent call last):
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py", line 4, in
import cv2
ImportError: No module named cv2
Traceback (most recent call last):
File "tools/test_net.py", line 119, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 155, in test_net_on_dataset
args, dataset_name, proposal_file, num_images, output_dir
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 187, in multi_gpu_test_net_on_dataset
args.load_ckpt, args.load_detectron, opts
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 109, in process_in_parallel
log_subprocess_output(i, p, output_dir, tag, start, end)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 147, in log_subprocess_output
assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret)
AssertionError: Range subprocess failed (exit code: 1)

I have installed opencv and successfully imported cv2, but i don't know what is caused to this problem. I have tried solution in https://github.com/facebookresearch/Detectron/issues/349 but it is not helpful. In config file e2e_mask_rcnn_R-50-C4_1x.yaml, I just re-config NUM_GPUS and keep original everything. Can you tell me what is this problem ?

The command that I ran:
python3 tools/test_net.py --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-50-C4_1x.yaml --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth --multi-gpu-testing --output_dir Output_val

System information

Operating system: Ubuntu 16.04.4 LTS
CUDA version: 8
cuDNN version: 5.1
GPU models (for all devices if they are not all the same): TITAN X (4 GPUS)
python version: 3.5
pytorch version: 0.3.1

roytseng-tw · 2018-05-20T14:48:46Z

https://github.com/roytseng-tw/Detectron.pytorch/blob/master/lib/utils/subprocess.py#L71
Change python to python3 may solve your problem.

tunglm2203 · 2018-05-20T15:11:00Z

Thank @roytseng-tw for fastly reply, I modified as your suggested link, the notify ImportError: No module named cv2 is fixed. But the problem about subprocess is still exist.

DEBUG: Run into test_net_data_set()
INFO subprocess.py: 88: detection range command 0: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 0 1250 --cfg Output_val/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir Output_val --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO subprocess.py: 88: detection range command 1: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 1250 2500 --cfg Output_val/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir Output_val --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO subprocess.py: 88: detection range command 2: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 2500 3750 --cfg Output_val/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir Output_val --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO subprocess.py: 88: detection range command 3: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 3750 5000 --cfg Output_val/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir Output_val --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO subprocess.py: 128: # ---------------------------------------------------------------------------- #
INFO subprocess.py: 130: stdout of subprocess 0 with range [1, 1250]
INFO subprocess.py: 132: # ---------------------------------------------------------------------------- #
INFO test_net.py: 73: Called with args:
INFO test_net.py: 74: Namespace(cfg_file='Output_val/detection_range_config.yaml', dataset=None, load_ckpt='Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth', load_detectron=None, multi_gpu_testing=False, output_dir='Output_val', range=[0, 1250], set_cfgs=['TEST.DATASETS', '("coco_2017_val",)'], vis=False)
Traceback (most recent call last):
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py", line 76, in
assert (torch.cuda.device_count() == 1) ^ bool(args.multi_gpu_testing)
AssertionError
Traceback (most recent call last):
File "tools/test_net.py", line 117, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 155, in test_net_on_dataset
args, dataset_name, proposal_file, num_images, output_dir
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 187, in multi_gpu_test_net_on_dataset
args.load_ckpt, args.load_detectron, opts
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 108, in process_in_parallel
log_subprocess_output(i, p, output_dir, tag, start, end)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 146, in log_subprocess_output
assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret)
AssertionError: Range subprocess failed (exit code: 1)

tunglm2203 · 2018-05-20T15:20:00Z

I have modified the command in https://github.com/roytseng-tw/Detectron.pytorch/blob/master/lib/utils/subprocess.py#L71 by adding --multi-gpu-testing', but there another problem:
INFO test_engine.py: 330: loading checkpoint Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
Traceback (most recent call last):
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py", line 118, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 125, in result_getter
gpu_id=gpu_id
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 253, in test_net
cls_boxes_i, cls_segms_i, cls_keyps_i = im_detect_all(model, im, box_proposals, timers)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test.py", line 70, in im_detect_all
model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE, box_proposals)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test.py", line 139, in im_detect_bbox
return_dict = model(**inputs)
File "/mnt/hdd/tung/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/nn/parallel/data_parallel.py", line 82, in forward
mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/nn/parallel/data_parallel.py", line 82, in
mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
IndexError: list index out of range
Traceback (most recent call last):
File "tools/test_net.py", line 118, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 154, in test_net_on_dataset
args, dataset_name, proposal_file, num_images, output_dir
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 186, in multi_gpu_test_net_on_dataset
args.load_ckpt, args.load_detectron, opts
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 107, in process_in_parallel
log_subprocess_output(i, p, output_dir, tag, start, end)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 145, in log_subprocess_output
assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret)
AssertionError: Range subprocess failed (exit code: 1)

roytseng-tw · 2018-05-20T16:45:03Z

You should checkout the inference section in README.
Specify --multi-gpu-testing if multiple gpus are available.

tunglm2203 · 2018-05-21T01:08:00Z

Thank @roytseng-tw , actually, I have passed --multi-gpu-testing in my command:
python3 tools/test_net.py --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-50-C4_1x.yaml --load_ckpt Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth --multi-gpu-testing --output_dir Output_val

But in https://github.com/roytseng-tw/Detectron.pytorch/blob/1833c71a62e389d2b5f873f40a914c5a47bdd8a2/lib/utils/subprocess.py#L71 , --multi-gpu-testing have not pass to subprocess, I have changed command passed to subprocess, but there another problem like above.

roytseng-tw · 2018-05-21T01:49:51Z

You should not change anything except python --> python3.

tunglm2203 · 2018-05-21T02:08:17Z

Update: @roytseng-tw yes, I keep everything as you said. I have tried evaluate in only one GPU, it run successfully, but when I pass -multi-gpu-testing in my command, and specific gpu device through CUDA_VISIBLE_DEVICES. It still gets error

roytseng-tw · 2018-05-21T02:21:56Z

You should not pass --multi-gpu-testing to subprocesses, and what's the error ?

tunglm2203 · 2018-05-21T02:39:01Z

Here is error,
INFO test_engine.py: 330: loading checkpoint Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 626/1250 1.590s + 0.056s (eta: 0:17:06)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 636/1250 0.365s + 0.030s (eta: 0:04:02)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 646/1250 0.330s + 0.028s (eta: 0:03:36)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 656/1250 0.337s + 0.031s (eta: 0:03:38)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 666/1250 0.333s + 0.029s (eta: 0:03:31)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 676/1250 0.313s + 0.027s (eta: 0:03:15)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 686/1250 0.314s + 0.025s (eta: 0:03:11)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 696/1250 0.305s + 0.024s (eta: 0:03:02)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 706/1250 0.298s + 0.023s (eta: 0:02:54)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 716/1250 0.307s + 0.024s (eta: 0:02:56)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 726/1250 0.305s + 0.024s (eta: 0:02:52)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 736/1250 0.301s + 0.024s (eta: 0:02:46)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 746/1250 0.302s + 0.023s (eta: 0:02:44)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 756/1250 0.298s + 0.023s (eta: 0:02:38)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 766/1250 0.298s + 0.022s (eta: 0:02:35)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 776/1250 0.296s + 0.022s (eta: 0:02:30)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 786/1250 0.295s + 0.022s (eta: 0:02:27)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 796/1250 0.290s + 0.022s (eta: 0:02:21)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 806/1250 0.293s + 0.023s (eta: 0:02:20)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 816/1250 0.292s + 0.022s (eta: 0:02:16)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 826/1250 0.292s + 0.022s (eta: 0:02:13)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 836/1250 0.293s + 0.022s (eta: 0:02:10)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 846/1250 0.296s + 0.022s (eta: 0:02:08)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 856/1250 0.297s + 0.022s (eta: 0:02:05)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 866/1250 0.296s + 0.022s (eta: 0:02:01)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 876/1250 0.295s + 0.022s (eta: 0:01:58)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 886/1250 0.294s + 0.022s (eta: 0:01:54)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 896/1250 0.292s + 0.021s (eta: 0:01:51)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 906/1250 0.292s + 0.021s (eta: 0:01:47)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 916/1250 0.291s + 0.021s (eta: 0:01:44)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 926/1250 0.292s + 0.022s (eta: 0:01:41)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 936/1250 0.291s + 0.021s (eta: 0:01:38)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 946/1250 0.289s + 0.021s (eta: 0:01:34)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 956/1250 0.288s + 0.021s (eta: 0:01:30)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 966/1250 0.287s + 0.021s (eta: 0:01:27)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 976/1250 0.287s + 0.022s (eta: 0:01:24)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 986/1250 0.287s + 0.021s (eta: 0:01:21)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 996/1250 0.285s + 0.021s (eta: 0:01:17)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1006/1250 0.287s + 0.021s (eta: 0:01:15)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1016/1250 0.287s + 0.021s (eta: 0:01:12)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1026/1250 0.289s + 0.022s (eta: 0:01:09)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1036/1250 0.289s + 0.022s (eta: 0:01:06)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1046/1250 0.289s + 0.022s (eta: 0:01:03)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1056/1250 0.288s + 0.021s (eta: 0:01:00)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1066/1250 0.288s + 0.021s (eta: 0:00:56)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1076/1250 0.287s + 0.021s (eta: 0:00:53)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1086/1250 0.287s + 0.021s (eta: 0:00:50)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1096/1250 0.287s + 0.021s (eta: 0:00:47)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1106/1250 0.288s + 0.021s (eta: 0:00:44)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1116/1250 0.287s + 0.021s (eta: 0:00:41)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1126/1250 0.287s + 0.021s (eta: 0:00:38)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1136/1250 0.288s + 0.021s (eta: 0:00:35)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1146/1250 0.289s + 0.021s (eta: 0:00:32)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1156/1250 0.290s + 0.021s (eta: 0:00:29)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1166/1250 0.290s + 0.021s (eta: 0:00:26)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1176/1250 0.289s + 0.021s (eta: 0:00:22)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1186/1250 0.289s + 0.021s (eta: 0:00:19)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1196/1250 0.289s + 0.021s (eta: 0:00:16)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1206/1250 0.289s + 0.021s (eta: 0:00:13)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1216/1250 0.292s + 0.022s (eta: 0:00:10)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1226/1250 0.291s + 0.022s (eta: 0:00:07)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1236/1250 0.291s + 0.022s (eta: 0:00:04)
INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1246/1250 0.291s + 0.022s (eta: 0:00:01)
INFO test_engine.py: 314: Wrote detections to: /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/test_output/detection_range_625_1250.pkl

INFO test_engine.py: 211: Wrote detections to: /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/test_output/detections.pkl
INFO test_engine.py: 161: Total inference time: 212.193s
INFO task_evaluation.py: 75: Evaluating detections
Traceback (most recent call last):
File "tools/test_net.py", line 118, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 163, in test_net_on_dataset
dataset, all_boxes, all_segms, all_keyps, output_dir
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/task_evaluation.py", line 59, in evaluate_all
dataset, all_boxes, output_dir, use_matlab=use_matlab
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/task_evaluation.py", line 79, in evaluate_boxes
dataset, all_boxes, output_dir, use_salt=not_comp, cleanup=not_comp
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 135, in evaluate_boxes
_write_coco_bbox_results_file(json_dataset, all_boxes, res_file)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 160, in _write_coco_bbox_results_file
json_dataset, all_boxes[cls_ind], cat_id))
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 171, in _coco_bbox_results_one_category
assert len(boxes) == len(image_ids)
AssertionError

To clear,
In config file: e2e_mask_rcnn_R-50-C4_1x.yaml, I set NUM_GPUS=2, and I pass gpu id through CUDA_VISIBLE_DEVICES=5,6

roytseng-tw · 2018-05-21T03:56:23Z

First, you don't need to change anything in the config file if you use CUDA_VISIBLE_DEVICES to set available gpus.
Second, from your log, It's like that you were using 8 gpus to run the testing (range [626, 1250] of 5000) instead of 2.

Below is my deduction:
You are on a machine of 8 gpus (5000/625) , and you didn't successfully set CUDA_VISIBLE_DEVICES.

CUDA_VISIBLE_DEVICES is a environment variable checked by cuda driver. To set it, you can either do

export CUDA_VISIBLE_DEVICES=5,6
CUDA_VISIBLE_DEVICES=5,6 python tools/test_net.py ...

tunglm2203 · 2018-05-21T04:44:31Z

Yes, I am on machine with 8 GPUs, but I am only allowed to run on 2 GPUs, so I want to use only 2 GPUs 5 and 6. I ran as you said: CUDA_VISIBLE_DEVICES=5,6 && python tools/test_net.py .... Here is tail of my log

INFO test_engine.py: 281: im_detect: range [626, 1250] of 5000: 1246/1250 0.292s + 0.021s (eta: 0:00:01)
INFO test_engine.py: 314: Wrote detections to: /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detection_range_625_1250.pkl

INFO test_engine.py: 211: Wrote detections to: /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detections.pkl
INFO test_engine.py: 161: Total inference time: 212.159s
INFO task_evaluation.py: 75: Evaluating detections
Traceback (most recent call last):
File "tools/test_net.py", line 111, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 163, in test_net_on_dataset
dataset, all_boxes, all_segms, all_keyps, output_dir
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/task_evaluation.py", line 59, in evaluate_all
dataset, all_boxes, output_dir, use_matlab=use_matlab
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/task_evaluation.py", line 79, in evaluate_boxes
dataset, all_boxes, output_dir, use_salt=not_comp, cleanup=not_comp
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 135, in evaluate_boxes
_write_coco_bbox_results_file(json_dataset, all_boxes, res_file)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 160, in _write_coco_bbox_results_file
json_dataset, all_boxes[cls_ind], cat_id))
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/datasets/json_dataset_evaluator.py", line 171, in _coco_bbox_results_one_category
assert len(boxes) == len(image_ids)
AssertionError

roytseng-tw · 2018-05-21T06:14:21Z

I find a weird thing in your log range [626, 1250] of 5000: 1246/1250: length of dataset and indices do not match ! Are you using a clean code ?

tunglm2203 · 2018-05-21T10:13:58Z

I only add this line in head of test_net.py file and keep everything:
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="2,3"

I see that process is divided into 2 subprocess, first range is [1, 2500], but it fails in assert like log below.

Here my new full log:
INFO test_net.py: 70: Called with args:
INFO test_net.py: 71: Namespace(cfg_file='configs/e2e_mask_rcnn_R-50-C4_1x.yaml', dataset='coco2017', load_ckpt='/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth', load_detectron=None, multi_gpu_testing=True, output_dir=None, range=None, set_cfgs=[], vis=False)
INFO test_net.py: 81: Automatically set output directory to /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test
INFO test_net.py: 102: Testing with config:
INFO test_net.py: 103: {'BBOX_XFORM_CLIP': 4.135166556742356,
'CROP_RESIZE_WITH_MAX_POOL': True,
'CUDA': False,
'DATA_DIR': '/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/data',
'DATA_LOADER': {'NUM_THREADS': 4},
'DEBUG': False,
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXPECTED_RESULTS': [],
'EXPECTED_RESULTS_ATOL': 0.005,
'EXPECTED_RESULTS_EMAIL': '',
'EXPECTED_RESULTS_RTOL': 0.1,
'FAST_RCNN': {'MLP_HEAD_DIM': 1024,
'ROI_BOX_HEAD': 'ResNet.ResNet_roi_conv5_head',
'ROI_XFORM_METHOD': 'RoIAlign',
'ROI_XFORM_RESOLUTION': 14,
'ROI_XFORM_SAMPLING_RATIO': 0},
'FPN': {'COARSEST_STRIDE': 32,
'DIM': 256,
'EXTRA_CONV_LEVELS': False,
'FPN_ON': False,
'MULTILEVEL_ROIS': False,
'MULTILEVEL_RPN': False,
'ROI_CANONICAL_LEVEL': 4,
'ROI_CANONICAL_SCALE': 224,
'ROI_MAX_LEVEL': 5,
'ROI_MIN_LEVEL': 2,
'RPN_ANCHOR_START_SIZE': 32,
'RPN_ASPECT_RATIOS': (0.5, 1, 2),
'RPN_COLLECT_SCALE': 1,
'RPN_MAX_LEVEL': 6,
'RPN_MIN_LEVEL': 2,
'ZERO_INIT_LATERAL': False},
'KRCNN': {'CONV_HEAD_DIM': 256,
'CONV_HEAD_KERNEL': 3,
'CONV_INIT': 'GaussianFill',
'DECONV_DIM': 256,
'DECONV_KERNEL': 4,
'DILATION': 1,
'HEATMAP_SIZE': -1,
'INFERENCE_MIN_SIZE': 0,
'KEYPOINT_CONFIDENCE': 'bbox',
'LOSS_WEIGHT': 1.0,
'MIN_KEYPOINT_COUNT_FOR_VALID_MINIBATCH': 20,
'NMS_OKS': False,
'NORMALIZE_BY_VISIBLE_KEYPOINTS': True,
'NUM_KEYPOINTS': -1,
'NUM_STACKED_CONVS': 8,
'ROI_KEYPOINTS_HEAD': '',
'ROI_XFORM_METHOD': 'RoIAlign',
'ROI_XFORM_RESOLUTION': 7,
'ROI_XFORM_SAMPLING_RATIO': 0,
'UP_SCALE': -1,
'USE_DECONV': False,
'USE_DECONV_OUTPUT': False},
'MATLAB': 'matlab',
'MODEL': {'BBOX_REG_WEIGHTS': (10.0, 10.0, 5.0, 5.0),
'CLS_AGNOSTIC_BBOX_REG': False,
'CONV_BODY': 'ResNet.ResNet50_conv4_body',
'FASTER_RCNN': True,
'KEYPOINTS_ON': False,
'LOAD_IMAGENET_PRETRAINED_WEIGHTS': True,
'MASK_ON': True,
'NUM_CLASSES': 81,
'RPN_ONLY': False,
'SHARE_RES5': True,
'TYPE': 'generalized_rcnn',
'UNSUPERVISED_POSE': False},
'MRCNN': {'CLS_SPECIFIC_MASK': True,
'CONV_INIT': 'MSRAFill',
'DILATION': 1,
'DIM_REDUCED': 256,
'MEMORY_EFFICIENT_LOSS': True,
'RESOLUTION': 14,
'ROI_MASK_HEAD': 'mask_rcnn_heads.mask_rcnn_fcn_head_v0upshare',
'ROI_XFORM_METHOD': 'RoIAlign',
'ROI_XFORM_RESOLUTION': 14,
'ROI_XFORM_SAMPLING_RATIO': 0,
'THRESH_BINARIZE': 0.5,
'UPSAMPLE_RATIO': 1,
'USE_FC_OUTPUT': False,
'WEIGHT_LOSS_MASK': 1.0},
'NUM_GPUS': 8,
'OUTPUT_DIR': 'Outputs',
'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]),
'POOLING_MODE': 'crop',
'POOLING_SIZE': 7,
'PYTORCH_VERSION_LESS_THAN_040': True,
'RESNETS': {'FREEZE_AT': 2,
'IMAGENET_PRETRAINED_WEIGHTS': 'data/pretrained_model/resnet50_caffe.pth',
'NUM_GROUPS': 1,
'RES5_DILATION': 1,
'STRIDE_1X1': True,
'TRANS_FUNC': 'bottleneck_transformation',
'WIDTH_PER_GROUP': 64},
'RETINANET': {'ANCHOR_SCALE': 4,
'ASPECT_RATIOS': (0.5, 1.0, 2.0),
'BBOX_REG_BETA': 0.11,
'BBOX_REG_WEIGHT': 1.0,
'CLASS_SPECIFIC_BBOX': False,
'INFERENCE_TH': 0.05,
'LOSS_ALPHA': 0.25,
'LOSS_GAMMA': 2.0,
'NEGATIVE_OVERLAP': 0.4,
'NUM_CONVS': 4,
'POSITIVE_OVERLAP': 0.5,
'PRE_NMS_TOP_N': 1000,
'PRIOR_PROB': 0.01,
'RETINANET_ON': False,
'SCALES_PER_OCTAVE': 3,
'SHARE_CLS_BBOX_TOWER': False,
'SOFTMAX': False},
'RFCN': {'PS_GRID_SIZE': 3},
'RNG_SEED': 3,
'ROOT_DIR': '/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch',
'RPN': {'ASPECT_RATIOS': (0.5, 1, 2),
'CLS_ACTIVATION': 'sigmoid',
'OUT_DIM': 512,
'OUT_DIM_AS_IN_DIM': True,
'RPN_ON': True,
'SIZES': (32, 64, 128, 256, 512),
'STRIDE': 16},
'SOLVER': {'BASE_LR': 0.01,
'BIAS_DOUBLE_LR': True,
'BIAS_WEIGHT_DECAY': False,
'GAMMA': 0.1,
'LOG_LR_CHANGE_THRESHOLD': 1.1,
'LRS': [],
'LR_POLICY': 'steps_with_decay',
'MAX_ITER': 180000,
'MOMENTUM': 0.9,
'SCALE_MOMENTUM': True,
'SCALE_MOMENTUM_THRESHOLD': 1.1,
'STEPS': [0, 120000, 160000],
'STEP_SIZE': 30000,
'TYPE': 'SGD',
'WARM_UP_FACTOR': 0.3333333333333333,
'WARM_UP_ITERS': 500,
'WARM_UP_METHOD': 'linear',
'WEIGHT_DECAY': 0.0001},
'TEST': {'BBOX_AUG': {'AREA_TH_HI': 32400,
'AREA_TH_LO': 2500,
'ASPECT_RATIOS': (),
'ASPECT_RATIO_H_FLIP': False,
'COORD_HEUR': 'UNION',
'ENABLED': False,
'H_FLIP': False,
'MAX_SIZE': 4000,
'SCALES': (),
'SCALE_H_FLIP': False,
'SCALE_SIZE_DEP': False,
'SCORE_HEUR': 'UNION'},
'BBOX_REG': True,
'BBOX_VOTE': {'ENABLED': False,
'SCORING_METHOD': 'ID',
'SCORING_METHOD_BETA': 1.0,
'VOTE_TH': 0.8},
'COMPETITION_MODE': True,
'DATASETS': ('coco_2017_val',),
'DETECTIONS_PER_IM': 100,
'FORCE_JSON_DATASET_EVAL': False,
'KPS_AUG': {'AREA_TH': 32400,
'ASPECT_RATIOS': (),
'ASPECT_RATIO_H_FLIP': False,
'ENABLED': False,
'HEUR': 'HM_AVG',
'H_FLIP': False,
'MAX_SIZE': 4000,
'SCALES': (),
'SCALE_H_FLIP': False,
'SCALE_SIZE_DEP': False},
'MASK_AUG': {'AREA_TH': 32400,
'ASPECT_RATIOS': (),
'ASPECT_RATIO_H_FLIP': False,
'ENABLED': False,
'HEUR': 'SOFT_AVG',
'H_FLIP': False,
'MAX_SIZE': 4000,
'SCALES': (),
'SCALE_H_FLIP': False,
'SCALE_SIZE_DEP': False},
'MAX_SIZE': 1333,
'NMS': 0.5,
'PRECOMPUTED_PROPOSALS': False,
'PROPOSAL_FILES': (),
'PROPOSAL_LIMIT': 2000,
'RPN_MIN_SIZE': 0,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 1000,
'RPN_PRE_NMS_TOP_N': 6000,
'SCALE': 800,
'SCORE_THRESH': 0.05,
'SOFT_NMS': {'ENABLED': False, 'METHOD': 'linear', 'SIGMA': 0.5}},
'TRAIN': {'ASPECT_CROPPING': False,
'ASPECT_GROUPING': True,
'ASPECT_HI': 2,
'ASPECT_LO': 0.5,
'BATCH_SIZE_PER_IM': 512,
'BBOX_INSIDE_WEIGHTS': (1.0, 1.0, 1.0, 1.0),
'BBOX_NORMALIZE_MEANS': (0.0, 0.0, 0.0, 0.0),
'BBOX_NORMALIZE_STDS': (0.1, 0.1, 0.2, 0.2),
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': False,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'CROWD_FILTER_THRESH': 0.7,
'DATASETS': (),
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'FREEZE_CONV_BODY': False,
'GT_MIN_AREA': -1,
'IMS_PER_BATCH': 1,
'MAX_SIZE': 1333,
'PROPOSAL_FILES': (),
'RPN_BATCH_SIZE_PER_IM': 256,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 0,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'RPN_STRADDLE_THRESH': 0,
'SCALES': (800,),
'SNAPSHOT_ITERS': 20000,
'USE_FLIPPED': True},
'VIS': False,
'VIS_TH': 0.9}
loading annotations into memory...
Done (t=0.72s)
creating index...
index created!
INFO subprocess.py: 87: detection range command 0: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 0 2500 --cfg /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test --load_ckpt /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO subprocess.py: 87: detection range command 1: python3 /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py --range 2500 5000 --cfg /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detection_range_config.yaml --set TEST.DATASETS '("coco_2017_val",)' --output_dir /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test --load_ckpt /mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth
INFO subprocess.py: 127: # ---------------------------------------------------------------------------- #
INFO subprocess.py: 129: stdout of subprocess 0 with range [1, 2500]
INFO subprocess.py: 131: # ---------------------------------------------------------------------------- #
INFO test_net.py: 70: Called with args:
INFO test_net.py: 71: Namespace(cfg_file='/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test/detection_range_config.yaml', dataset=None, load_ckpt='/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/ckpt/model_step89999.pth', load_detectron=None, multi_gpu_testing=False, output_dir='/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/Outputs/e2e_mask_rcnn_R-50-C4_1x/May17-21-45-19_slspGPU6_step/test', range=[0, 2500], set_cfgs=['TEST.DATASETS', '("coco_2017_val",)'], vis=False)
Traceback (most recent call last):
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/tools/test_net.py", line 73, in
assert (torch.cuda.device_count() == 1) ^ bool(args.multi_gpu_testing)
AssertionError
Traceback (most recent call last):
File "tools/test_net.py", line 114, in
check_expected_results=True)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 154, in test_net_on_dataset
args, dataset_name, proposal_file, num_images, output_dir
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/core/test_engine.py", line 186, in multi_gpu_test_net_on_dataset
args.load_ckpt, args.load_detectron, opts
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 107, in process_in_parallel
log_subprocess_output(i, p, output_dir, tag, start, end)
File "/mnt/hdd/tung/aim_2018/try_model/mask-rcnn.pytorch/lib/utils/subprocess.py", line 145, in log_subprocess_output
assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret)
AssertionError: Range subprocess failed (exit code: 1)

roytseng-tw · 2018-05-21T10:19:32Z

You should not add os.environ["CUDA_VISIBLE_DEVICES"]="2,3" to test_net.py

tunglm2203 · 2018-05-21T10:22:10Z

@roytseng-tw if I don't add this, the command CUDA_VISIBLE_DEVICES=5,6 && python tools/test_net.py... may be not successfully, it still detect 8 GPUs.

roytseng-tw · 2018-05-21T10:42:18Z

What's the output of this for you

CUDA_VISIBLE_DEVICES=5,6 python -c "import torch; print(torch.cuda.device_count())"

tunglm2203 · 2018-05-21T11:06:40Z

@roytseng-tw Output is 2

tunglm2203 · 2018-05-21T11:12:32Z

I try command:
CUDA_VISIBLE_DEVICES=5,6 python tools/test_net.py ... instead of CUDA_VISIBLE_DEVICES=5,6 && python tools/test_net.py ...
It may be helpful, I see process divide into 2 process with first range is [1, 2500], wait until run to next range ...

tunglm2203 · 2018-05-21T11:27:54Z

@roytseng-tw I have run sucessfully, thank you, I don't know why it not detect GPU device ID when I use && to concatenate command. Once again, thank you so much !

tunglm2203 closed this as completed May 21, 2018

tunglm2203 reopened this May 21, 2018

tunglm2203 closed this as completed May 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError: Range subprocess failed (exit code: 1) #63

AssertionError: Range subprocess failed (exit code: 1) #63

tunglm2203 commented May 20, 2018

roytseng-tw commented May 20, 2018

tunglm2203 commented May 20, 2018

tunglm2203 commented May 20, 2018 •

edited

roytseng-tw commented May 20, 2018

tunglm2203 commented May 21, 2018

roytseng-tw commented May 21, 2018

tunglm2203 commented May 21, 2018 •

edited

roytseng-tw commented May 21, 2018

tunglm2203 commented May 21, 2018

roytseng-tw commented May 21, 2018 •

edited

tunglm2203 commented May 21, 2018

roytseng-tw commented May 21, 2018 •

edited

tunglm2203 commented May 21, 2018 •

edited

roytseng-tw commented May 21, 2018

tunglm2203 commented May 21, 2018 •

edited

roytseng-tw commented May 21, 2018

tunglm2203 commented May 21, 2018 •

edited

tunglm2203 commented May 21, 2018

tunglm2203 commented May 21, 2018

AssertionError: Range subprocess failed (exit code: 1) #63

AssertionError: Range subprocess failed (exit code: 1) #63

Comments

tunglm2203 commented May 20, 2018

System information

roytseng-tw commented May 20, 2018

tunglm2203 commented May 20, 2018

tunglm2203 commented May 20, 2018 • edited

roytseng-tw commented May 20, 2018

tunglm2203 commented May 21, 2018

roytseng-tw commented May 21, 2018

tunglm2203 commented May 21, 2018 • edited

roytseng-tw commented May 21, 2018

tunglm2203 commented May 21, 2018

roytseng-tw commented May 21, 2018 • edited

tunglm2203 commented May 21, 2018

roytseng-tw commented May 21, 2018 • edited

tunglm2203 commented May 21, 2018 • edited

roytseng-tw commented May 21, 2018

tunglm2203 commented May 21, 2018 • edited

roytseng-tw commented May 21, 2018

tunglm2203 commented May 21, 2018 • edited

tunglm2203 commented May 21, 2018

tunglm2203 commented May 21, 2018

tunglm2203 commented May 20, 2018 •

edited

tunglm2203 commented May 21, 2018 •

edited

roytseng-tw commented May 21, 2018 •

edited

roytseng-tw commented May 21, 2018 •

edited

tunglm2203 commented May 21, 2018 •

edited

tunglm2203 commented May 21, 2018 •

edited

tunglm2203 commented May 21, 2018 •

edited