Skip to content

Conversation

fbbradheintz
Copy link
Contributor

@fbbradheintz fbbradheintz commented Oct 15, 2019

This PR provides correctness tests and scriptability checks for all classification models in TorchVision, including Alexnet, Inception, Squeezenet, Modelnet, Googlenet, VGG, MNAS, Densenet, Shufflenet, and Resnet (as well as variations of some of these, such as memory-efficient Densenet).

@fbbradheintz fbbradheintz requested a review from fmassa October 15, 2019 17:17
@fbbradheintz fbbradheintz self-assigned this Oct 15, 2019
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Let's wait until we fix CI before merging this PR.

BTW, I think you might have lint failures in your PR, can you try applying pep8 to check?

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind updating the summary of the PR with your changes?

I think it is a good idea to add test fixtures to the models, so in that sense this is an improvement. I'm not sure that manually adding all models as a TestCase instead of automatically is an improvement. I find it harder to read this code and validate what is being tested. It is true that what models exist is more discoverable now, but i'm not sure for whom this was a problem.

There are benefits and drawbacks of each approach though and this is just one input.



def set_rng_seed(seed):
EPSILON = 1e-5 # small value for approximate comparisons/assertions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used once, maybe delete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It started off being more broadly used, but I refactored most of that away. I still favor keeping a named constant here - either EPSILON or TOLERANCE - rather than just inserting a magic number with no explanation.

class ClassificationCoverageTester(TestCase):

# Find all models exposed by torchvision.models factory methods (with assumptions)
def get_available_classification_models(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we only testing classification models for this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other model types are still using the old, implicit method of test generation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there should be two separate pathways at one time.

Copy link
Contributor Author

@fbbradheintz fbbradheintz Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but my brief was to cover the classification models. Also, I think the three other major classes of model should probably get broken out into separate files.

me_output = self._infer_for_test_with(me_model, test_input)
test_output.squeeze(0)
me_output.squeeze(0)
self.assertTrue((test_output - me_output).abs().max() < EPSILON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just use assertNestedTensorObjectsEqual or use the assertEqual that is in @driazati PR here #1430 ? both of these methods already have a tolerance arg builtin to them.

Also since EPSILON is only used once i think it would be better to inline the usage here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to switch over to using a more idiomatic assertion call. I'll look into the ones you flagged.

I'm less jazzed about taking away the label, even if it's only used once. Numbers have no semantic content. Not a huge deal in this case, but I prefer a name like EPSILON or TOLERANCE to a bare constant every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other usages aren't magic numbers, the tolerances are used with keyword arguments and have clear semantic meaning.

self.assertExpected(test_value, rtol=.01, atol=.01)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - named arguments serve the semantic purpose. I'll tweak this.

# caching because these tests take pretty long already (instantiating models and all)
TEST_INPUTS = {}

def _get_test_input(self, shape=STANDARD_INPUT_SHAPE):
Copy link
Contributor

@eellison eellison Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test that this had any noticeable perf improvement? We have also run into memory issues in the past...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speed-up is not huge - about a second on a test suite that takes about 200s. It's also not a huge memory hit, since currently only two modestly-sized input tensors are being created. If it becomes an issue, I'm happy to make needed changes.

Copy link
Contributor

@driazati driazati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not enumerating the models being tested did make the tests unclear, but this seems very verbose and easy to get out of sync with torchvision.models. Given that testing reports the name, could we just better comment on the scheme that turns a model into a test? That would cut down on a lot of the changes needed in this PR

# dilation) and says nothing about the correctness of the test, nor
# of the model. It just enforces a naming scheme on the tests, and
# verifies that all models have a corresponding test name.
def test_classification_model_coverage(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't the torchvision.models factory methods tested directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated tests were implicit and unintentional, and most of the tests were identical, except for a few that were added/tweaked in decision branches. Besides the fact that I was enhancing the test with correctness checks for each model, these generated tests exhibit at least two anti-patterns:

  1. Behavior changes that depend on type should be handled by subclassing, not by conditionals.
  2. This creates a state where I can add a new model, and a (potentially) passing test gets generated without my ever looking at it. If the passing test is correct, great, I lucked out - but if it's not I have no warning. This is why tests should be explicit and intentional.

In the new code:

  1. Test methods are grouped by the models they test. Making a change to ResNet? All the tests for all variants are in one obviously-named class, and you don't have to tease them out of an introspection loop. Differences like the special input shape needed by Inception are handled in the subclass devoted to Inception, not by polluting a conditional branch of omnibus code that tries to cover every variant of every class.
  2. If you make a new model, and give it a factory method in torchvision.models, test_classification_model_coverage() fails the next time you run tests; thus, the dev is nudged to intentionally create a new test that covers the new model/factory method.

So we get more maintainable tests, and lower overhead to having to change tests for new/updated models, and an automatic canary test for new factory methods in torchvision.models.

Looking ever further forward, our model offerings are likely to continue to grow, and (I hope) we'll continue building out the test suite, both to cover new models and give better coverage on existing ones. Keeping an increasingly complex test suite for a growing body of code under test in a single class governed by decision branches will turn ugly. The new scheme should have more extensibility and a lower cost of change.

Copy link
Contributor

@eellison eellison Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think it's best to avoid huge refactorings before they're clearly needed. In this case i'm not sure the need has been demonstrated. While the situation you're describing of increasingly complex conditionals seems plausible, it isn't the case today, and I think the updated version here is a substantial increase to the complexity of the code.
This adds:
- caching of model inputs per dimensionality
- introspection of test coverage
- two separate pathways for testing scriptability
- two separate pathways for adding model tests.
- quadruples the amount of code in this test file

I don't think the possibility of someone adding a new model and there being no testing for it is likely. If someone adds a new model, like memory efficient densenet, they're expected to add a new test for it. I don't think this is a difficult standard to uphold as a contributor or reviewer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactoring was needed in order to work with the tests in isolation. Those implicit, unintentional tests were a PITA to isolate and change, which limits their utility. Also, the correctness checks themselves are an example of the building complexity - they are sometimes slightly different for different models, e.g. checking that the memory efficient Densenet implementation showed the same output as the vanilla version. To your other points:

  • I'm not married to the test input caching, and if there's a demonstrable issue that I'm not aware of (as opposed to merely a prospective one), I'm happy to change it.
  • The introspection was already happening. I only included it so we wouldn't lose the sense of security people seem to get from the implicit/unintentional tests being generated over the factory methods.
  • If by "two separate pathways" you mean the difference between classification and other model types, that's true. My brief only covered Classification models. I think the other model types should be broken out into their own files (as their testing needs will likely differ), and updated to dump the opaque generated tests.
  • Yes, there is more code, but it's also modularized in a way that explicitly maps the tests to the code under test, removes conditionals in favor of OO, and makes it trivially obvious to find where to make a change or set a breakpoint for a particular model with a text search. "Less code" is not, in every situation, the best variable to optimize for.

Your last point - about test being enforced by process - is orthogonal to the point I made above. If someone adds a model, never looks at the (implicit, unintentional) tests, and the automated tests all show green, the likelihood is against either the dev or a reviewer going deeper. The generated tests leave that door open. The new tests do nothing if the dev codes their tests first, and gives a nudge if they don't.

me_output = self._infer_for_test_with(me_model, test_input)
test_output.squeeze(0)
me_output.squeeze(0)
self.assertTrue((test_output - me_output).abs().max() < EPSILON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other usages aren't magic numbers, the tolerances are used with keyword arguments and have clear semantic meaning.

self.assertExpected(test_value, rtol=.01, atol=.01)

# dilation) and says nothing about the correctness of the test, nor
# of the model. It just enforces a naming scheme on the tests, and
# verifies that all models have a corresponding test name.
def test_classification_model_coverage(self):
Copy link
Contributor

@eellison eellison Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think it's best to avoid huge refactorings before they're clearly needed. In this case i'm not sure the need has been demonstrated. While the situation you're describing of increasingly complex conditionals seems plausible, it isn't the case today, and I think the updated version here is a substantial increase to the complexity of the code.
This adds:
- caching of model inputs per dimensionality
- introspection of test coverage
- two separate pathways for testing scriptability
- two separate pathways for adding model tests.
- quadruples the amount of code in this test file

I don't think the possibility of someone adding a new model and there being no testing for it is likely. If someone adds a new model, like memory efficient densenet, they're expected to add a new test for it. I don't think this is a difficult standard to uphold as a contributor or reviewer.

class ClassificationCoverageTester(TestCase):

# Find all models exposed by torchvision.models factory methods (with assumptions)
def get_available_classification_models(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there should be two separate pathways at one time.

Copy link
Contributor

@driazati driazati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR contains 2 major changes: a big refactor + correctness tests, these should probably be split into 2 PRs. The correctness tests could be added with a much simpler framework around them to make them run, and separating out the refactoring changes would make it easier to review

@fbbradheintz
Copy link
Contributor Author

fbbradheintz commented Oct 17, 2019

Per conversation with @fmassa, I'm merging this.

@fbbradheintz fbbradheintz merged commit 1e857d9 into master Oct 17, 2019
@fbbradheintz fbbradheintz deleted the issues/1234 branch October 17, 2019 17:46
fmassa added a commit that referenced this pull request Oct 18, 2019
fmassa added a commit that referenced this pull request Oct 18, 2019
@fbbradheintz fbbradheintz restored the issues/1234 branch October 18, 2019 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants