You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to reproduce the COCO results for 4 GPUs. We were able to run things on 2 P100s but when we switched to 4 V100s, we got this:
2021-03-17 12:21:19,650 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:19,653 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:23,132 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:23,133 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:29,916 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:29,920 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:33,335 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:33,336 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:36,785 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:36,787 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:00,481 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:00,483 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:03,868 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:03,869 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:14,195 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:14,196 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:17,611 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:20,966 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:20,968 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:24,389 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:24,390 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:28,071 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:28,072 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:31,507 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:31,508 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:39,329 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:39,330 | x2num.py | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:39,332 | callback.py | line 40 : Batch [20] Speed: 0.83 samples/sec Train-rpn_cls_loss=283.769145, rpn_bbox_loss=nan, rcnn_accuracy=nan, cls_loss=nan, bbox_loss=nan, mask_loss=24.724522, fcn_loss=nan, fcn_roi_loss=1.313301, panoptic_accuracy=0.267114, panoptic_loss=27.402849,
Does anyone know what might be causing it? We're using the normal COCO dataset and the provided upsnet_resnet50_coco_4gpu.yaml config file.
The text was updated successfully, but these errors were encountered:
Just an update to this, it runs fine if I switch to just 1 V100, but on 4 V100s (or even 2 V100s), it seems to break like this and I run into an invalid axis error.
For anyone who runs into this issue in the future, we did find the solution. This issue has to do with the kernel and how the machine boots up. We had to disable IOMMU passthrough for the PCI bus in our grub.cfg. After doing this, we were able to run without the issue. Seems the GPUs were having an issue tied to a feature of that architecture that doesn't apply to the P100s. It makes the GPUs not communicate successfully, which is why it worked on the single V100.
I was trying to reproduce the COCO results for 4 GPUs. We were able to run things on 2 P100s but when we switched to 4 V100s, we got this:
Does anyone know what might be causing it? We're using the normal COCO dataset and the provided
upsnet_resnet50_coco_4gpu.yaml
config file.The text was updated successfully, but these errors were encountered: