Skip to content

fasterrcnn_resnet50_fpn Linux GPU tests failing on CUDA 11.6 #6655

@atalman

Description

@atalman

🐛 Describe the bug

Similar to: #6589

After the removal of CUDA 10.2 and the setting of 11.6 as default the tests at fasterrcnn_resnet50_fpn started failing for python 3.8:

Traceback (most recent call last):
  File "/home/circleci/project/test/test_models.py", line 777, in check_out
    _assert_expected(output, model_name, prec=prec)
  File "/home/circleci/project/test/test_models.py", line 117, in _assert_expected
    torch.testing.assert_close(output, expected, rtol=rtol, atol=atol, check_dtype=False, check_device=False)
  File "/home/circleci/project/env/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
    assert_equal(
  File "/home/circleci/project/env/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 59 / 80 (73.8%)
Greatest absolute difference: 223.815927028656 at index (12, 2) (up to 0.012 allowed)
Greatest relative difference: inf at index (3, 1) (up to 0.012 allowed)

The failure occurred for item [0]['boxes']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/circleci/project/test/test_models.py", line 805, in test_detection_model
    full_validation &= check_out(out, autocast_custom_prec.get(model_name, 0.01))
  File "/home/circleci/project/test/test_models.py", line 785, in check_out
    torch.testing.assert_close(
  File "/home/circleci/project/env/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
    assert_equal(
  File "/home/circleci/project/env/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 2 / 20 (10.0%)
Greatest absolute difference: 0.02667921781539917 at index (16,) (up to 0.012 allowed)
Greatest relative difference: 0.03139688501900038 at index (18,) (up to 0.012 allowed)

This failure occurs on Linux

Versions

PR: #6649

cc @Datum @malfet @ptrblck

Versions

latest nightly

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions