Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting NaN for loss/accuracy values on 4 GPU config file #141

Closed
rlangefe opened this issue Mar 17, 2021 · 2 comments
Closed

Getting NaN for loss/accuracy values on 4 GPU config file #141

rlangefe opened this issue Mar 17, 2021 · 2 comments

Comments

@rlangefe
Copy link

I was trying to reproduce the COCO results for 4 GPUs. We were able to run things on 2 P100s but when we switched to 4 V100s, we got this:

2021-03-17 12:21:19,650 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:19,653 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:23,132 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:23,133 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:29,916 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:29,920 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:33,335 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:33,336 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:36,785 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:21:36,787 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:00,481 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:00,483 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:03,868 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:03,869 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:14,195 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:14,196 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:17,611 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:20,966 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:20,968 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:24,389 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:24,390 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:28,071 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:28,072 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:31,507 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:31,508 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:39,329 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:39,330 | x2num.py   | line 14 : NaN or Inf found in input tensor.
2021-03-17 12:22:39,332 | callback.py | line 40 : Batch [20]	Speed: 0.83 samples/sec	Train-rpn_cls_loss=283.769145,	rpn_bbox_loss=nan,	rcnn_accuracy=nan,	cls_loss=nan,	bbox_loss=nan,	mask_loss=24.724522,	fcn_loss=nan,	fcn_roi_loss=1.313301,	panoptic_accuracy=0.267114,	panoptic_loss=27.402849,

Does anyone know what might be causing it? We're using the normal COCO dataset and the provided upsnet_resnet50_coco_4gpu.yaml config file.

@rlangefe
Copy link
Author

Just an update to this, it runs fine if I switch to just 1 V100, but on 4 V100s (or even 2 V100s), it seems to break like this and I run into an invalid axis error.

@rlangefe
Copy link
Author

For anyone who runs into this issue in the future, we did find the solution. This issue has to do with the kernel and how the machine boots up. We had to disable IOMMU passthrough for the PCI bus in our grub.cfg. After doing this, we were able to run without the issue. Seems the GPUs were having an issue tied to a feature of that architecture that doesn't apply to the P100s. It makes the GPUs not communicate successfully, which is why it worked on the single V100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant