Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve testing for models #1234

Open
fmassa opened this issue Aug 12, 2019 · 15 comments
Open

Improve testing for models #1234

fmassa opened this issue Aug 12, 2019 · 15 comments

Comments

@fmassa
Copy link
Member

fmassa commented Aug 12, 2019

We currently have very limited testing for the models in torchvision.

In most cases, we only test for the output shape to be correct, see

def _test_classification_model(self, name, input_shape):
# passing num_class equal to a number other than 1000 helps in making the test
# more enforcing in nature
model = models.__dict__[name](num_classes=50)
model.eval()
x = torch.rand(input_shape)
out = model(x)
self.assertEqual(out.shape[-1], 50)
def _test_segmentation_model(self, name):
# passing num_class equal to a number other than 1000 helps in making the test
# more enforcing in nature
model = models.segmentation.__dict__[name](num_classes=50, pretrained_backbone=False)
model.eval()
input_shape = (1, 3, 300, 300)
x = torch.rand(input_shape)
out = model(x)
self.assertEqual(tuple(out["out"].shape), (1, 50, 300, 300))
def _test_detection_model(self, name):
model = models.detection.__dict__[name](num_classes=50, pretrained_backbone=False)
model.eval()
input_shape = (3, 300, 300)
x = torch.rand(input_shape)
model_input = [x]
out = model(model_input)
self.assertIs(model_input[0], x)
self.assertEqual(len(out), 1)
self.assertTrue("boxes" in out[0])
self.assertTrue("scores" in out[0])
self.assertTrue("labels" in out[0])

While this allows to catch bugs such as Py2-Py3 differences, it is not enough to ensure that refactorings of the models keep the behavior the same (and that pre-trained weights are still valid).

We should come up with a better testing strategy.

Ideally, we would not want to download the pre-trained weights, because this can make the tests prone to failure due to IO issues.

One possible approach would be to fix the random seed and initialize the models in a particular way and compare the output we get for a particular input with an expected output, ensuring that the output is non-trivial (such as all-zeros or empty, which could happen due to ReLU or for detection models). Something in the lines of

torch.manual_seed(42)
m = torchvision.models.resnet18(num_classes=2)
torch.manual_seed(42)
i = torch.rand(2, 3, 224, 224)
output = m(i)
expected_output = torch.tensor([[0.0002, 0.3], [0.245, 0.001]])
assert output.equals(expected_output)

The problem with this approach is that we do not guarantee that the RNG is the same between different versions of PyTorch, which means that this way of testing the model would lead to failures if we change PyTorch's RNG.

This issue will be particularly important to be addressed soon, because we will be adding some slight modifications to the model implementations in order for them to be traceable / ONNX-exportable. @lara-hdr is currently looking into making the torchvision models ONNX-ready.

cc @fbbradheintz , who showed interest in addressing this issue.

@fbbradheintz
Copy link
Contributor

I'm on it, @fmassa - I'll post a comment here here with a plan.

@fbbradheintz fbbradheintz self-assigned this Aug 14, 2019
@fbbradheintz
Copy link
Contributor

My current plan:

  • Introduce fixed seeds for all pseudo-random processes.
  • Provide known inputs to pseudo-randomly initialized models. (With the fixed seeds, this should have the same effect as using a set of fixed weights.)
  • Pass input to the model.
  • Check a sampling of output values from the output tensor - enough to provide high statistical confidence that the whole output is correct without making the test long-running - to verify that they are within some epsilon of a target value.

This process will be repeated for all models covered by the test, probably in a subclass that specializes the initialization, input, and output based on the needs of the model.

One possible pitfall is PRNGs giving different results for same seeds on different platforms/stdlib versions. I'll try to test a cross-section of build environments to see whether this is the case. If it is, the fallback will probably be some other algorithmic generation of weights that provides a reasonable distribution of values (perhaps based on the statistical distribution of values in the pretrained versions of the models).

@fmassa
Copy link
Member Author

fmassa commented Aug 26, 2019

@fbbradheintz This sounds good to me!

The PRNGs are going to be different between CPU/GPU for sure, but I'm not sure if it will be different depending on the platform -- to be tested.

This will at least be a good starting point, and if we make the model initialization it's own method in a tester class, then we can change the initialization of all models in a single function change.

@SebastienEske
Copy link

Just an idea but if you could select the RNG in torch.random and maybe have a basic one for testing purposes that would solve the problem, no? Since it's for testing it doesn't even need to be super random.

@fmassa
Copy link
Member Author

fmassa commented Aug 30, 2019

@SebastienEske yes, that's an idea that could work. It might be a bit more complicated to make it work for detection models (which have lots of conditional and dynamic blocks) though, as we would want to have a non-trivial solution to be output (like a solution with some non-empty boxes).

@pmeier
Copy link
Collaborator

pmeier commented Sep 12, 2019

Playing the devils advocate here:

Have you considered the non-determinism that some functions in the models might exhibit? These effects might occur even when the same action is performed twice in a row on the same system without intermittent meddling. Thus, simply testing for the same output could be flaky.

For more information on the non-determinism see this talk. It is mainly centered around TensorFlow, but there is a small segment starting at 32:31 about PyTorch.

The author states:

I hear that some ops may still be non-deterministic

which lead me to try to confirm this. Ultimately, I "reproduced" the non-determinism and asked a question about it in the official forum. In short:

You cannot eliminate non-determinism even if you follow the official guidelines for reproducibility.

@fmassa
Copy link
Member Author

fmassa commented Sep 13, 2019

The non-determinism is a real complication, and I believe @fbbradheintz has hit some of those non-determinisms with a few models.

@fbbradheintz
Copy link
Contributor

I've seen it with the Inception v3 model for sure. The rest seem well behaved when using default constructors and a seeded PRNG.

Once I have the deterministic models sorted out, I'll circle back to Inception and any others that need attention.

@t-vi
Copy link
Contributor

t-vi commented Sep 24, 2019

Just in case: For the forwards (which seems most relevant for tracing, nll_loss2d and scatter_* are the prime suspects for non-determinism).

@fbbradheintz
Copy link
Contributor

Just to follow up on the non-determinism thread: The culprit was the numpy PRNG. This is sorted out in the draft PR.

@pmeier
Copy link
Collaborator

pmeier commented Oct 10, 2019

#1424 for cross reference.

@fbbradheintz
Copy link
Contributor

@fmassa - my last PR covered the classification models. Do we want to handle video, detection, and segmentation as well before we call this complete?

@fmassa
Copy link
Member Author

fmassa commented Oct 29, 2019

@fbbradheintz yes, we would like to have all models covered by correctness checks if possible.

@fbbradheintz
Copy link
Contributor

@fmassa - There are currently multiple test failures (mostly related to scriptability) for segmentation, detection and video models. I'll watch for those to clear up before I start building on them (unless you have a suggestion for a branch besides master to build on).

There are some other passing tests that still need correctness tests added, I'll address them in the meantime.

@fmassa
Copy link
Member Author

fmassa commented Nov 18, 2019

@fbbradheintz CI should be green now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants