Skip to content

Conversation

driazati
Copy link
Contributor

@driazati driazati commented Oct 8, 2019

This copies some logic from test_jit.py to check that a TorchScript'ed
model's outputs are the same as outputs from the model in eager mode.

To support differences in TorchScript / eager mode outputs, an
unwrapper function can be provided per-model.

These tests take a while (ex. 4s to 3 minutes for inception_v3), so PYTORCH_TEST_WITH_SLOW=1 must be on for the TorchScript tests to run (there are CI changes to enable this flag). With module caching changes coming after 1.3 this time should be dramatically reduced.

This is ready to go but needs #1407 to land first

else:
self.assertNestedTensorObjectsEqual(output, expected, rtol=rtol, atol=atol)

def assertEqual(self, x, y, prec=None, message='', allow_inf=False):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all copied from PyTorch's common_utils

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is maybe more general than it needs to be - would assertNestedTensorObjectsEqual suffice ? this function e.g. does coercion between numbers and tensor results that you wouldn't want to allow for testing model equality. it's also a good amount of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertEqual has been pretty stable and widely used in PyTorch, I think it's better to just copy the entire thing and avoid having to move it over piece by piece later on if some functionality is missing from assertNestedTensorObjectsEqual

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, but we shouldn't have both in this file, we should only have one.

else:
self.assertEqual(a, b)

def checkModule(self, nn_module, args, unwrapper=None, skip=False):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from pytorch/test/jit_utils.py

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

else:
self.assertNestedTensorObjectsEqual(output, expected, rtol=rtol, atol=atol)

def assertEqual(self, x, y, prec=None, message='', allow_inf=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is maybe more general than it needs to be - would assertNestedTensorObjectsEqual suffice ? this function e.g. does coercion between numbers and tensor results that you wouldn't want to allow for testing model equality. it's also a good amount of code.

# as possible
SCRIPT_MODELS_TO_FIX = [
'test_inception_v3',
'test_fcn_resnet101',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the models don't give the same results in eager and torchscript?

Copy link
Contributor Author

@driazati driazati Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug in test_deeplabv3_resnet101. It compiles but when executing gives pytorch/pytorch#27549. The others that were here previously were just unimplemented, but they're good to go now

SCRIPT_MODELS_TO_FIX = [
'test_inception_v3',
'test_fcn_resnet101',
'test_deeplabv3_resnet101',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went to #1352 and the model results stayed the same before and after. Also IntermediateLayerGetter is used in other models which correctly return the eager and script result. So it's unclear why these two models are failing.

When i ran one in script I had to cntrl-C because it was taking so long. I would guess it's a bug somewhere in the JIT runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests do add a of time to the test exec time (@suo's module changes should help this), this PR uses PYTORCH_TEST_WITH_SLOW and @slowTest (on by default in the CI) to get around that.

@codecov-io
Copy link

codecov-io commented Oct 8, 2019

Codecov Report

Merging #1430 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1430   +/-   ##
=======================================
  Coverage   65.97%   65.97%           
=======================================
  Files          91       91           
  Lines        7229     7229           
  Branches     1095     1095           
=======================================
  Hits         4769     4769           
  Misses       2153     2153           
  Partials      307      307
Impacted Files Coverage Δ
torchvision/models/resnet.py 88.62% <100%> (+0.06%) ⬆️
torchvision/models/mobilenet.py 87.17% <100%> (+0.16%) ⬆️
torchvision/models/detection/rpn.py 81.05% <100%> (-0.25%) ⬇️
torchvision/models/quantization/resnet.py 93.25% <100%> (ø) ⬆️
torchvision/models/quantization/shufflenetv2.py 90.16% <100%> (ø) ⬆️
torchvision/models/quantization/mobilenet.py 87.75% <100%> (ø) ⬆️
torchvision/models/shufflenetv2.py 86.04% <100%> (+0.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e8bcf8...6e07772. Read the comment docs.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

The change in config.yml can't get in without a corresponding change in config.yml.in.

What are the tests that are taking too long now? Can we maybe enable them in a separate PR, because we might need a separate CI for it maybe.

export DOCKER_IMAGE=soumith/conda-cuda
export VARS_TO_PASS="-e PYTHON_VERSION -e BUILD_VERSION -e PYTORCH_VERSION -e UNICODE_ABI -e CU_VERSION"
export PYTORCH_TEST_WITH_SLOW=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those changes should be done in config.yml.in, and then we should call python regenerate.py that will create the config.yml for you.

Also, there is a single CI test that runs with this configuration (gpu.medium), which doesn't get tested on Python2 for example.

I think that it might be better to separate this change from this PR, because it might require changing how we handle CI builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the tests have a pretty significant bump in runtime

Top 10 with PYTORCH_TEST_WITH_SLOW=0:

63.04s call     test/test_models.py::ModelTester::test_memory_efficient_densenet   
60.13s call     test/test_models.py::ModelTester::test_fasterrcnn_double           
48.19s call     test/test_models.py::ModelTester::test_keypointrcnn_resnet50_fpn   
41.29s call     test/test_models.py::ModelTester::test_maskrcnn_resnet50_fpn       
36.69s call     test/test_models.py::ModelTester::test_fasterrcnn_resnet50_fpn
24.86s call     test/test_models.py::ModelTester::test_vgg19_bn
24.33s call     test/test_models.py::ModelTester::test_vgg16_bn
23.98s call     test/test_models.py::ModelTester::test_vgg13
23.49s call     test/test_models.py::ModelTester::test_vgg11_bn
23.13s call     test/test_models.py::ModelTester::test_deeplabv3_resnet101

Top 10 after PYTORCH_TEST_WITH_SLOW=1:

575.13s call     test/test_models.py::ModelTester::test_densenet121                
276.74s call     test/test_models.py::ModelTester::test_fcn_resnet101              
218.32s call     test/test_models.py::ModelTester::test_inception_v3               
182.89s call     test/test_models.py::ModelTester::test_shufflenet_v2_x1_0         
179.08s call     test/test_models.py::ModelTester::test_mobilenet_v2
171.40s call     test/test_models.py::ModelTester::test_deeplabv3_resnet101
127.75s call     test/test_models.py::ModelTester::test_googlenet
93.93s call     test/test_models.py::ModelTester::test_resnext50_32x4d
70.70s call     test/test_models.py::ModelTester::test_memory_efficient_densenet

PyTorch does have a CI job specifically for slow tests, maybe something like that would be better here too. I can separate the CI changes to a different PR, leaving this one to rely on manually setting PYTORCH_TEST_WITH_SLOW=1 to run the TorchScript tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. Let me add a specific CI test for testing this in a single configuration for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@driazati tests are failing for faster rcnn with precision issues. Can you have a look (FYI I force-pushed to your branch to rebase). I'm looking into configuring CI to run the slow tests

Your Name added 8 commits October 18, 2019 15:19
This copies some logic from `test_jit.py` to check that a TorchScript'ed
model's outputs are the same as outputs from the model in eager mode.

To support differences in TorchScript / eager mode outputs, an
`unwrapper` function can be provided per-model.
@fmassa fmassa force-pushed the driazati/checkmodule branch from 6df1d48 to b3471f9 Compare October 18, 2019 13:19
Copy link
Contributor Author

@driazati driazati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some jitter on the test results for a couple models but there was no change to the model code so there are some new --accept results


ACCEPT = os.getenv('EXPECTTEST_ACCEPT')
TEST_WITH_SLOW = os.getenv('PYTORCH_TEST_WITH_SLOW', '0') == '1'
TEST_WITH_SLOW = True # TODO: Delete this line once there is a PYTORCH_TEST_WITH_SLOW aware CI job
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be deleted once the aforementioned CI job is up and running

@driazati driazati requested a review from fmassa October 31, 2019 01:37
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me to land - probably need to rebase one more time to make sure that it works on master. What do you think @fmassa ?

@fmassa
Copy link
Member

fmassa commented Oct 31, 2019

There seems to be a problem with latest PyTorch binaries

AttributeError: 'RecursiveScriptModule' object has no attribute 'forward'

CI on master seems to be working OK, #1542

I'm ok merging this as soon as CI is green (that's BTW why I haven't merged it last week when I first rebased it)

@fmassa fmassa merged commit 227027d into master Nov 30, 2019
@fmassa fmassa deleted the driazati/checkmodule branch November 30, 2019 15:23
fmassa pushed a commit to fmassa/vision-1 that referenced this pull request Jun 9, 2020
* Add tests for results in script vs eager mode

This copies some logic from `test_jit.py` to check that a TorchScript'ed
model's outputs are the same as outputs from the model in eager mode.

To support differences in TorchScript / eager mode outputs, an
`unwrapper` function can be provided per-model.

* Fix inception, use PYTORCH_TEST_WITH_SLOW

* Update

* Remove assertNestedTensorObjectsEqual

* Add PYTORCH_TEST_WITH_SLOW to CircleCI config

* Add MaskRCNN unwrapper

* fix prec args

* Remove CI changes

* update

* Update

* remove expect changes

* Fix tolerance bug

* Fix breakages

* Fix quantized resnet

* Fix merge errors and simplify code

* DeepLabV3 has been fixed

* Temporarily disable jit compilation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants