-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Added correctness tests for classification models #1468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Let's wait until we fix CI before merging this PR.
BTW, I think you might have lint failures in your PR, can you try applying pep8
to check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind updating the summary of the PR with your changes?
I think it is a good idea to add test fixtures to the models, so in that sense this is an improvement. I'm not sure that manually adding all models as a TestCase instead of automatically is an improvement. I find it harder to read this code and validate what is being tested. It is true that what models exist is more discoverable now, but i'm not sure for whom this was a problem.
There are benefits and drawbacks of each approach though and this is just one input.
test/test_models.py
Outdated
|
||
|
||
def set_rng_seed(seed): | ||
EPSILON = 1e-5 # small value for approximate comparisons/assertions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used once, maybe delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It started off being more broadly used, but I refactored most of that away. I still favor keeping a named constant here - either EPSILON
or TOLERANCE
- rather than just inserting a magic number with no explanation.
class ClassificationCoverageTester(TestCase): | ||
|
||
# Find all models exposed by torchvision.models factory methods (with assumptions) | ||
def get_available_classification_models(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we only testing classification models for this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other model types are still using the old, implicit method of test generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there should be two separate pathways at one time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but my brief was to cover the classification models. Also, I think the three other major classes of model should probably get broken out into separate files.
test/test_models.py
Outdated
me_output = self._infer_for_test_with(me_model, test_input) | ||
test_output.squeeze(0) | ||
me_output.squeeze(0) | ||
self.assertTrue((test_output - me_output).abs().max() < EPSILON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to switch over to using a more idiomatic assertion call. I'll look into the ones you flagged.
I'm less jazzed about taking away the label, even if it's only used once. Numbers have no semantic content. Not a huge deal in this case, but I prefer a name like EPSILON
or TOLERANCE
to a bare constant every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other usages aren't magic numbers, the tolerances are used with keyword arguments and have clear semantic meaning.
self.assertExpected(test_value, rtol=.01, atol=.01)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool - named arguments serve the semantic purpose. I'll tweak this.
# caching because these tests take pretty long already (instantiating models and all) | ||
TEST_INPUTS = {} | ||
|
||
def _get_test_input(self, shape=STANDARD_INPUT_SHAPE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test that this had any noticeable perf improvement? We have also run into memory issues in the past...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speed-up is not huge - about a second on a test suite that takes about 200s. It's also not a huge memory hit, since currently only two modestly-sized input tensors are being created. If it becomes an issue, I'm happy to make needed changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not enumerating the models being tested did make the tests unclear, but this seems very verbose and easy to get out of sync with torchvision.models
. Given that testing reports the name, could we just better comment on the scheme that turns a model into a test? That would cut down on a lot of the changes needed in this PR
# dilation) and says nothing about the correctness of the test, nor | ||
# of the model. It just enforces a naming scheme on the tests, and | ||
# verifies that all models have a corresponding test name. | ||
def test_classification_model_coverage(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't the torchvision.models
factory methods tested directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generated tests were implicit and unintentional, and most of the tests were identical, except for a few that were added/tweaked in decision branches. Besides the fact that I was enhancing the test with correctness checks for each model, these generated tests exhibit at least two anti-patterns:
- Behavior changes that depend on type should be handled by subclassing, not by conditionals.
- This creates a state where I can add a new model, and a (potentially) passing test gets generated without my ever looking at it. If the passing test is correct, great, I lucked out - but if it's not I have no warning. This is why tests should be explicit and intentional.
In the new code:
- Test methods are grouped by the models they test. Making a change to ResNet? All the tests for all variants are in one obviously-named class, and you don't have to tease them out of an introspection loop. Differences like the special input shape needed by Inception are handled in the subclass devoted to Inception, not by polluting a conditional branch of omnibus code that tries to cover every variant of every class.
- If you make a new model, and give it a factory method in
torchvision.models
,test_classification_model_coverage()
fails the next time you run tests; thus, the dev is nudged to intentionally create a new test that covers the new model/factory method.
So we get more maintainable tests, and lower overhead to having to change tests for new/updated models, and an automatic canary test for new factory methods in torchvision.models
.
Looking ever further forward, our model offerings are likely to continue to grow, and (I hope) we'll continue building out the test suite, both to cover new models and give better coverage on existing ones. Keeping an increasingly complex test suite for a growing body of code under test in a single class governed by decision branches will turn ugly. The new scheme should have more extensibility and a lower cost of change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I think it's best to avoid huge refactorings before they're clearly needed. In this case i'm not sure the need has been demonstrated. While the situation you're describing of increasingly complex conditionals seems plausible, it isn't the case today, and I think the updated version here is a substantial increase to the complexity of the code.
This adds:
- caching of model inputs per dimensionality
- introspection of test coverage
- two separate pathways for testing scriptability
- two separate pathways for adding model tests.
- quadruples the amount of code in this test file
I don't think the possibility of someone adding a new model and there being no testing for it is likely. If someone adds a new model, like memory efficient densenet, they're expected to add a new test for it. I don't think this is a difficult standard to uphold as a contributor or reviewer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactoring was needed in order to work with the tests in isolation. Those implicit, unintentional tests were a PITA to isolate and change, which limits their utility. Also, the correctness checks themselves are an example of the building complexity - they are sometimes slightly different for different models, e.g. checking that the memory efficient Densenet implementation showed the same output as the vanilla version. To your other points:
- I'm not married to the test input caching, and if there's a demonstrable issue that I'm not aware of (as opposed to merely a prospective one), I'm happy to change it.
- The introspection was already happening. I only included it so we wouldn't lose the sense of security people seem to get from the implicit/unintentional tests being generated over the factory methods.
- If by "two separate pathways" you mean the difference between classification and other model types, that's true. My brief only covered Classification models. I think the other model types should be broken out into their own files (as their testing needs will likely differ), and updated to dump the opaque generated tests.
- Yes, there is more code, but it's also modularized in a way that explicitly maps the tests to the code under test, removes conditionals in favor of OO, and makes it trivially obvious to find where to make a change or set a breakpoint for a particular model with a text search. "Less code" is not, in every situation, the best variable to optimize for.
Your last point - about test being enforced by process - is orthogonal to the point I made above. If someone adds a model, never looks at the (implicit, unintentional) tests, and the automated tests all show green, the likelihood is against either the dev or a reviewer going deeper. The generated tests leave that door open. The new tests do nothing if the dev codes their tests first, and gives a nudge if they don't.
test/test_models.py
Outdated
me_output = self._infer_for_test_with(me_model, test_input) | ||
test_output.squeeze(0) | ||
me_output.squeeze(0) | ||
self.assertTrue((test_output - me_output).abs().max() < EPSILON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other usages aren't magic numbers, the tolerances are used with keyword arguments and have clear semantic meaning.
self.assertExpected(test_value, rtol=.01, atol=.01)
# dilation) and says nothing about the correctness of the test, nor | ||
# of the model. It just enforces a naming scheme on the tests, and | ||
# verifies that all models have a corresponding test name. | ||
def test_classification_model_coverage(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I think it's best to avoid huge refactorings before they're clearly needed. In this case i'm not sure the need has been demonstrated. While the situation you're describing of increasingly complex conditionals seems plausible, it isn't the case today, and I think the updated version here is a substantial increase to the complexity of the code.
This adds:
- caching of model inputs per dimensionality
- introspection of test coverage
- two separate pathways for testing scriptability
- two separate pathways for adding model tests.
- quadruples the amount of code in this test file
I don't think the possibility of someone adding a new model and there being no testing for it is likely. If someone adds a new model, like memory efficient densenet, they're expected to add a new test for it. I don't think this is a difficult standard to uphold as a contributor or reviewer.
class ClassificationCoverageTester(TestCase): | ||
|
||
# Find all models exposed by torchvision.models factory methods (with assumptions) | ||
def get_available_classification_models(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there should be two separate pathways at one time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR contains 2 major changes: a big refactor + correctness tests, these should probably be split into 2 PRs. The correctness tests could be added with a much simpler framework around them to make them run, and separating out the refactoring changes would make it easier to review
6ff7030
to
fa0bd89
Compare
Per conversation with @fmassa, I'm merging this. |
This PR provides correctness tests and scriptability checks for all classification models in TorchVision, including Alexnet, Inception, Squeezenet, Modelnet, Googlenet, VGG, MNAS, Densenet, Shufflenet, and Resnet (as well as variations of some of these, such as memory-efficient Densenet).