Getting NaN for loss/accuracy values on 4 GPU config file #141

rlangefe · 2021-03-17T16:35:50Z

I was trying to reproduce the COCO results for 4 GPUs. We were able to run things on 2 P100s but when we switched to 4 V100s, we got this:

2021-03-17 12:21:19,650 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:19,653 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:23,132 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:23,133 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:29,916 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:29,920 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:33,335 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:33,336 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:36,785 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:36,787 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:00,481 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:00,483 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:03,868 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:03,869 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:14,195 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:14,196 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:17,611 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:20,966 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:20,968 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:24,389 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:24,390 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:28,071 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:28,072 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:31,507 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:31,508 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:39,329 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:39,330 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:39,332 | callback.py | line 40 : Batch [20]	Speed: 0.83 samples/sec	Train-rpn_cls_loss=283.769145,	rpn_bbox_loss=nan,	rcnn_accuracy=nan,	cls_loss=nan,	bbox_loss=nan,	mask_loss=24.724522,	fcn_loss=nan,	fcn_roi_loss=1.313301,	panoptic_accuracy=0.267114,	panoptic_loss=27.402849,

Does anyone know what might be causing it? We're using the normal COCO dataset and the provided upsnet_resnet50_coco_4gpu.yaml config file.

The text was updated successfully, but these errors were encountered:

rlangefe · 2021-03-19T15:36:28Z

Just an update to this, it runs fine if I switch to just 1 V100, but on 4 V100s (or even 2 V100s), it seems to break like this and I run into an invalid axis error.

rlangefe · 2021-03-23T19:37:10Z

For anyone who runs into this issue in the future, we did find the solution. This issue has to do with the kernel and how the machine boots up. We had to disable IOMMU passthrough for the PCI bus in our grub.cfg. After doing this, we were able to run without the issue. Seems the GPUs were having an issue tied to a feature of that architecture that doesn't apply to the P100s. It makes the GPUs not communicate successfully, which is why it worked on the single V100.

rlangefe closed this as completed Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting NaN for loss/accuracy values on 4 GPU config file #141

Getting NaN for loss/accuracy values on 4 GPU config file #141

rlangefe commented Mar 17, 2021

rlangefe commented Mar 19, 2021

rlangefe commented Mar 23, 2021

Getting NaN for loss/accuracy values on 4 GPU config file #141

Getting NaN for loss/accuracy values on 4 GPU config file #141

Comments

rlangefe commented Mar 17, 2021

rlangefe commented Mar 19, 2021

rlangefe commented Mar 23, 2021