Improve testing for models #1234

fmassa · 2019-08-12T11:43:05Z

We currently have very limited testing for the models in torchvision.

In most cases, we only test for the output shape to be correct, see

Lines 29 to 59 in 8635be9

    
           def _test_classification_model(self, name, input_shape): 
        
               # passing num_class equal to a number other than 1000 helps in making the test 
        
               # more enforcing in nature 
        
               model = models.__dict__[name](num_classes=50) 
        
               model.eval() 
        
               x = torch.rand(input_shape) 
        
               out = model(x) 
        
               self.assertEqual(out.shape[-1], 50) 
        
           def _test_segmentation_model(self, name): 
        
               # passing num_class equal to a number other than 1000 helps in making the test 
        
               # more enforcing in nature 
        
               model = models.segmentation.__dict__[name](num_classes=50, pretrained_backbone=False) 
        
               model.eval() 
        
               input_shape = (1, 3, 300, 300) 
        
               x = torch.rand(input_shape) 
        
               out = model(x) 
        
               self.assertEqual(tuple(out["out"].shape), (1, 50, 300, 300)) 
        
           def _test_detection_model(self, name): 
        
               model = models.detection.__dict__[name](num_classes=50, pretrained_backbone=False) 
        
               model.eval() 
        
               input_shape = (3, 300, 300) 
        
               x = torch.rand(input_shape) 
        
               model_input = [x] 
        
               out = model(model_input) 
        
               self.assertIs(model_input[0], x) 
        
               self.assertEqual(len(out), 1) 
        
               self.assertTrue("boxes" in out[0]) 
        
               self.assertTrue("scores" in out[0]) 
        
               self.assertTrue("labels" in out[0])

While this allows to catch bugs such as Py2-Py3 differences, it is not enough to ensure that refactorings of the models keep the behavior the same (and that pre-trained weights are still valid).

We should come up with a better testing strategy.

Ideally, we would not want to download the pre-trained weights, because this can make the tests prone to failure due to IO issues.

One possible approach would be to fix the random seed and initialize the models in a particular way and compare the output we get for a particular input with an expected output, ensuring that the output is non-trivial (such as all-zeros or empty, which could happen due to ReLU or for detection models). Something in the lines of

torch.manual_seed(42)
m = torchvision.models.resnet18(num_classes=2)
torch.manual_seed(42)
i = torch.rand(2, 3, 224, 224)
output = m(i)
expected_output = torch.tensor([[0.0002, 0.3], [0.245, 0.001]])
assert output.equals(expected_output)

The problem with this approach is that we do not guarantee that the RNG is the same between different versions of PyTorch, which means that this way of testing the model would lead to failures if we change PyTorch's RNG.

This issue will be particularly important to be addressed soon, because we will be adding some slight modifications to the model implementations in order for them to be traceable / ONNX-exportable. @lara-hdr is currently looking into making the torchvision models ONNX-ready.

cc @fbbradheintz , who showed interest in addressing this issue.

The text was updated successfully, but these errors were encountered:

fbbradheintz · 2019-08-14T18:36:41Z

I'm on it, @fmassa - I'll post a comment here here with a plan.

fbbradheintz · 2019-08-20T20:06:03Z

My current plan:

Introduce fixed seeds for all pseudo-random processes.
Provide known inputs to pseudo-randomly initialized models. (With the fixed seeds, this should have the same effect as using a set of fixed weights.)
Pass input to the model.
Check a sampling of output values from the output tensor - enough to provide high statistical confidence that the whole output is correct without making the test long-running - to verify that they are within some epsilon of a target value.

This process will be repeated for all models covered by the test, probably in a subclass that specializes the initialization, input, and output based on the needs of the model.

One possible pitfall is PRNGs giving different results for same seeds on different platforms/stdlib versions. I'll try to test a cross-section of build environments to see whether this is the case. If it is, the fallback will probably be some other algorithmic generation of weights that provides a reasonable distribution of values (perhaps based on the statistical distribution of values in the pretrained versions of the models).

fmassa · 2019-08-26T14:21:58Z

@fbbradheintz This sounds good to me!

The PRNGs are going to be different between CPU/GPU for sure, but I'm not sure if it will be different depending on the platform -- to be tested.

This will at least be a good starting point, and if we make the model initialization it's own method in a tester class, then we can change the initialization of all models in a single function change.

SebastienEske · 2019-08-27T08:48:39Z

Just an idea but if you could select the RNG in torch.random and maybe have a basic one for testing purposes that would solve the problem, no? Since it's for testing it doesn't even need to be super random.

fmassa · 2019-08-30T09:58:57Z

@SebastienEske yes, that's an idea that could work. It might be a bit more complicated to make it work for detection models (which have lots of conditional and dynamic blocks) though, as we would want to have a non-trivial solution to be output (like a solution with some non-empty boxes).

pmeier · 2019-09-12T10:24:27Z

Playing the devils advocate here:

Have you considered the non-determinism that some functions in the models might exhibit? These effects might occur even when the same action is performed twice in a row on the same system without intermittent meddling. Thus, simply testing for the same output could be flaky.

For more information on the non-determinism see this talk. It is mainly centered around TensorFlow, but there is a small segment starting at 32:31 about PyTorch.

The author states:

I hear that some ops may still be non-deterministic

which lead me to try to confirm this. Ultimately, I "reproduced" the non-determinism and asked a question about it in the official forum. In short:

You cannot eliminate non-determinism even if you follow the official guidelines for reproducibility.

fmassa · 2019-09-13T16:58:40Z

The non-determinism is a real complication, and I believe @fbbradheintz has hit some of those non-determinisms with a few models.

fbbradheintz · 2019-09-13T17:01:48Z

I've seen it with the Inception v3 model for sure. The rest seem well behaved when using default constructors and a seeded PRNG.

Once I have the deterministic models sorted out, I'll circle back to Inception and any others that need attention.

t-vi · 2019-09-24T08:25:01Z

Just in case: For the forwards (which seems most relevant for tracing, nll_loss2d and scatter_* are the prime suspects for non-determinism).

fbbradheintz · 2019-10-09T21:50:21Z

Just to follow up on the non-determinism thread: The culprit was the numpy PRNG. This is sorted out in the draft PR.

pmeier · 2019-10-10T06:12:34Z

#1424 for cross reference.

fbbradheintz · 2019-10-28T20:31:50Z

@fmassa - my last PR covered the classification models. Do we want to handle video, detection, and segmentation as well before we call this complete?

fmassa · 2019-10-29T15:47:56Z

@fbbradheintz yes, we would like to have all models covered by correctness checks if possible.

fbbradheintz · 2019-11-12T22:45:57Z

@fmassa - There are currently multiple test failures (mostly related to scriptability) for segmentation, detection and video models. I'll watch for those to clear up before I start building on them (unless you have a suggestion for a branch besides master to build on).

There are some other passing tests that still need correctness tests added, I'll address them in the meantime.

fmassa · 2019-11-18T20:01:53Z

@fbbradheintz CI should be green now

fmassa added enhancement help wanted module: models labels Aug 12, 2019

fbbradheintz self-assigned this Aug 14, 2019

fbbradheintz removed the help wanted label Aug 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve testing for models #1234

Improve testing for models #1234

fmassa commented Aug 12, 2019

fbbradheintz commented Aug 14, 2019

fbbradheintz commented Aug 20, 2019

fmassa commented Aug 26, 2019

SebastienEske commented Aug 27, 2019

fmassa commented Aug 30, 2019

pmeier commented Sep 12, 2019

fmassa commented Sep 13, 2019

fbbradheintz commented Sep 13, 2019

t-vi commented Sep 24, 2019

fbbradheintz commented Oct 9, 2019

pmeier commented Oct 10, 2019

fbbradheintz commented Oct 28, 2019

fmassa commented Oct 29, 2019

fbbradheintz commented Nov 12, 2019

fmassa commented Nov 18, 2019

Improve testing for models #1234

Improve testing for models #1234

Comments

fmassa commented Aug 12, 2019

fbbradheintz commented Aug 14, 2019

fbbradheintz commented Aug 20, 2019

fmassa commented Aug 26, 2019

SebastienEske commented Aug 27, 2019

fmassa commented Aug 30, 2019

pmeier commented Sep 12, 2019

fmassa commented Sep 13, 2019

fbbradheintz commented Sep 13, 2019

t-vi commented Sep 24, 2019

fbbradheintz commented Oct 9, 2019

pmeier commented Oct 10, 2019

fbbradheintz commented Oct 28, 2019

fmassa commented Oct 29, 2019

fbbradheintz commented Nov 12, 2019

fmassa commented Nov 18, 2019