Added correctness tests for classification models #1468

fbbradheintz · 2019-10-15T17:17:27Z

This PR provides correctness tests and scriptability checks for all classification models in TorchVision, including Alexnet, Inception, Squeezenet, Modelnet, Googlenet, VGG, MNAS, Densenet, Shufflenet, and Resnet (as well as variations of some of these, such as memory-efficient Densenet).

fmassa

LGTM, thanks!

Let's wait until we fix CI before merging this PR.

BTW, I think you might have lint failures in your PR, can you try applying pep8 to check?

eellison

Do you mind updating the summary of the PR with your changes?

I think it is a good idea to add test fixtures to the models, so in that sense this is an improvement. I'm not sure that manually adding all models as a TestCase instead of automatically is an improvement. I find it harder to read this code and validate what is being tested. It is true that what models exist is more discoverable now, but i'm not sure for whom this was a problem.

There are benefits and drawbacks of each approach though and this is just one input.

eellison · 2019-10-15T19:14:44Z

test/test_models.py



-def set_rng_seed(seed):
+EPSILON = 1e-5  # small value for approximate comparisons/assertions


This is only used once, maybe delete

It started off being more broadly used, but I refactored most of that away. I still favor keeping a named constant here - either EPSILON or TOLERANCE - rather than just inserting a magic number with no explanation.

test/test_models.py

eellison · 2019-10-15T19:21:56Z

test/test_models.py

+class ClassificationCoverageTester(TestCase):
+
+    # Find all models exposed by torchvision.models factory methods (with assumptions)
+    def get_available_classification_models(self):


Why are we only testing classification models for this ?

The other model types are still using the old, implicit method of test generation.

I don't think there should be two separate pathways at one time.

I agree, but my brief was to cover the classification models. Also, I think the three other major classes of model should probably get broken out into separate files.

eellison · 2019-10-15T19:26:22Z

test/test_models.py

+        me_output = self._infer_for_test_with(me_model, test_input)
+        test_output.squeeze(0)
+        me_output.squeeze(0)
+        self.assertTrue((test_output - me_output).abs().max() < EPSILON)


Could we just use assertNestedTensorObjectsEqual or use the assertEqual that is in @driazati PR here #1430 ? both of these methods already have a tolerance arg builtin to them.

Also since EPSILON is only used once i think it would be better to inline the usage here.

I'm happy to switch over to using a more idiomatic assertion call. I'll look into the ones you flagged.

I'm less jazzed about taking away the label, even if it's only used once. Numbers have no semantic content. Not a huge deal in this case, but I prefer a name like EPSILON or TOLERANCE to a bare constant every time.

The other usages aren't magic numbers, the tolerances are used with keyword arguments and have clear semantic meaning.

self.assertExpected(test_value, rtol=.01, atol=.01)

Cool - named arguments serve the semantic purpose. I'll tweak this.

test/test_models.py

eellison · 2019-10-15T19:31:17Z

test/test_models.py

+    # caching because these tests take pretty long already (instantiating models and all)
+    TEST_INPUTS = {}
+
+    def _get_test_input(self, shape=STANDARD_INPUT_SHAPE):


Did you test that this had any noticeable perf improvement? We have also run into memory issues in the past...

The speed-up is not huge - about a second on a test suite that takes about 200s. It's also not a huge memory hit, since currently only two modestly-sized input tensors are being created. If it becomes an issue, I'm happy to make needed changes.

driazati

Not enumerating the models being tested did make the tests unclear, but this seems very verbose and easy to get out of sync with torchvision.models. Given that testing reports the name, could we just better comment on the scheme that turns a model into a test? That would cut down on a lot of the changes needed in this PR

test/test_models.py

driazati · 2019-10-15T20:03:01Z

test/test_models.py

+    #    dilation) and says nothing about the correctness of the test, nor
+    #    of the model. It just enforces a naming scheme on the tests, and
+    #    verifies that all models have a corresponding test name.
+    def test_classification_model_coverage(self):


Why aren't the torchvision.models factory methods tested directly?

The generated tests were implicit and unintentional, and most of the tests were identical, except for a few that were added/tweaked in decision branches. Besides the fact that I was enhancing the test with correctness checks for each model, these generated tests exhibit at least two anti-patterns:

Behavior changes that depend on type should be handled by subclassing, not by conditionals.

This creates a state where I can add a new model, and a (potentially) passing test gets generated without my ever looking at it. If the passing test is correct, great, I lucked out - but if it's not I have no warning. This is why tests should be explicit and intentional.

In the new code:

Test methods are grouped by the models they test. Making a change to ResNet? All the tests for all variants are in one obviously-named class, and you don't have to tease them out of an introspection loop. Differences like the special input shape needed by Inception are handled in the subclass devoted to Inception, not by polluting a conditional branch of omnibus code that tries to cover every variant of every class.

If you make a new model, and give it a factory method in torchvision.models, test_classification_model_coverage() fails the next time you run tests; thus, the dev is nudged to intentionally create a new test that covers the new model/factory method.

So we get more maintainable tests, and lower overhead to having to change tests for new/updated models, and an automatic canary test for new factory methods in torchvision.models.

Looking ever further forward, our model offerings are likely to continue to grow, and (I hope) we'll continue building out the test suite, both to cover new models and give better coverage on existing ones. Keeping an increasingly complex test suite for a growing body of code under test in a single class governed by decision branches will turn ugly. The new scheme should have more extensibility and a lower cost of change.

Generally I think it's best to avoid huge refactorings before they're clearly needed. In this case i'm not sure the need has been demonstrated. While the situation you're describing of increasingly complex conditionals seems plausible, it isn't the case today, and I think the updated version here is a substantial increase to the complexity of the code.
This adds:
- caching of model inputs per dimensionality
- introspection of test coverage
- two separate pathways for testing scriptability
- two separate pathways for adding model tests.
- quadruples the amount of code in this test file

I don't think the possibility of someone adding a new model and there being no testing for it is likely. If someone adds a new model, like memory efficient densenet, they're expected to add a new test for it. I don't think this is a difficult standard to uphold as a contributor or reviewer.

This refactoring was needed in order to work with the tests in isolation. Those implicit, unintentional tests were a PITA to isolate and change, which limits their utility. Also, the correctness checks themselves are an example of the building complexity - they are sometimes slightly different for different models, e.g. checking that the memory efficient Densenet implementation showed the same output as the vanilla version. To your other points:

I'm not married to the test input caching, and if there's a demonstrable issue that I'm not aware of (as opposed to merely a prospective one), I'm happy to change it.

The introspection was already happening. I only included it so we wouldn't lose the sense of security people seem to get from the implicit/unintentional tests being generated over the factory methods.

If by "two separate pathways" you mean the difference between classification and other model types, that's true. My brief only covered Classification models. I think the other model types should be broken out into their own files (as their testing needs will likely differ), and updated to dump the opaque generated tests.

Yes, there is more code, but it's also modularized in a way that explicitly maps the tests to the code under test, removes conditionals in favor of OO, and makes it trivially obvious to find where to make a change or set a breakpoint for a particular model with a text search. "Less code" is not, in every situation, the best variable to optimize for.

Your last point - about test being enforced by process - is orthogonal to the point I made above. If someone adds a model, never looks at the (implicit, unintentional) tests, and the automated tests all show green, the likelihood is against either the dev or a reviewer going deeper. The generated tests leave that door open. The new tests do nothing if the dev codes their tests first, and gives a nudge if they don't.

test/test_models.py

eellison · 2019-10-15T22:50:17Z

test/test_models.py

+        me_output = self._infer_for_test_with(me_model, test_input)
+        test_output.squeeze(0)
+        me_output.squeeze(0)
+        self.assertTrue((test_output - me_output).abs().max() < EPSILON)


The other usages aren't magic numbers, the tolerances are used with keyword arguments and have clear semantic meaning.

self.assertExpected(test_value, rtol=.01, atol=.01)

eellison · 2019-10-15T23:05:36Z

test/test_models.py

+    #    dilation) and says nothing about the correctness of the test, nor
+    #    of the model. It just enforces a naming scheme on the tests, and
+    #    verifies that all models have a corresponding test name.
+    def test_classification_model_coverage(self):


Generally I think it's best to avoid huge refactorings before they're clearly needed. In this case i'm not sure the need has been demonstrated. While the situation you're describing of increasingly complex conditionals seems plausible, it isn't the case today, and I think the updated version here is a substantial increase to the complexity of the code.
This adds:
- caching of model inputs per dimensionality
- introspection of test coverage
- two separate pathways for testing scriptability
- two separate pathways for adding model tests.
- quadruples the amount of code in this test file

I don't think the possibility of someone adding a new model and there being no testing for it is likely. If someone adds a new model, like memory efficient densenet, they're expected to add a new test for it. I don't think this is a difficult standard to uphold as a contributor or reviewer.

eellison · 2019-10-15T23:06:42Z

test/test_models.py

+class ClassificationCoverageTester(TestCase):
+
+    # Find all models exposed by torchvision.models factory methods (with assumptions)
+    def get_available_classification_models(self):


I don't think there should be two separate pathways at one time.

driazati

This PR contains 2 major changes: a big refactor + correctness tests, these should probably be split into 2 PRs. The correctness tests could be added with a much simpler framework around them to make them run, and separating out the refactoring changes would make it easier to review

fbbradheintz · 2019-10-17T17:39:40Z

Per conversation with @fmassa, I'm merging this.

This reverts commit 1e857d9.

…1493) This reverts commit 1e857d9.

fbbradheintz requested a review from fmassa October 15, 2019 17:17

fbbradheintz self-assigned this Oct 15, 2019

fmassa approved these changes Oct 15, 2019

View reviewed changes

eellison reviewed Oct 15, 2019

View reviewed changes

driazati reviewed Oct 15, 2019

View reviewed changes

eellison reviewed Oct 15, 2019

View reviewed changes

driazati reviewed Oct 15, 2019

View reviewed changes

fmassa closed this Oct 16, 2019

fmassa reopened this Oct 16, 2019

fmassa mentioned this pull request Oct 16, 2019

Quantizable resnet and mobilenet models #1471

Merged

fbbradheintz added 4 commits October 17, 2019 10:25

added correctness tests for classification models

7c62b52

flake8 fixes

884a4ca

took out bogus link in comment

5c4c145

updated correctness checks to use tolerances more idiomatically

fa0bd89

fbbradheintz force-pushed the issues/1234 branch from 6ff7030 to fa0bd89 Compare October 17, 2019 17:27

flake8 fix

8a20ae0

fbbradheintz merged commit 1e857d9 into master Oct 17, 2019

fbbradheintz deleted the issues/1234 branch October 17, 2019 17:46

fmassa added a commit that referenced this pull request Oct 18, 2019

Revert "Added correctness tests for classification models (#1468)"

8d24fa0

This reverts commit 1e857d9.

fmassa mentioned this pull request Oct 18, 2019

Revert "Added correctness tests for classification models" #1493

Merged

fmassa added a commit that referenced this pull request Oct 18, 2019

Revert "Added correctness tests for classification models (#1468)" (#…

c00222c

…1493) This reverts commit 1e857d9.

fbbradheintz restored the issues/1234 branch October 18, 2019 18:53



		def set_rng_seed(seed):
		EPSILON = 1e-5 # small value for approximate comparisons/assertions

Added correctness tests for classification models #1468

Added correctness tests for classification models #1468

Uh oh!

Conversation

fbbradheintz commented Oct 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fbbradheintz Oct 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eellison Oct 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

driazati left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eellison Oct 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eellison Oct 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

driazati left a comment

Choose a reason for hiding this comment

Uh oh!

fbbradheintz commented Oct 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fbbradheintz commented Oct 15, 2019 •

edited

Loading

fbbradheintz Oct 16, 2019 •

edited

Loading

eellison Oct 15, 2019 •

edited

Loading

eellison Oct 15, 2019 •

edited

Loading

eellison Oct 15, 2019 •

edited

Loading

fbbradheintz commented Oct 17, 2019 •

edited

Loading