Receptive field in CNN-based architecture, and more about the usage of conv blocks #50

yluo42 · 2019-05-02T07:11:10Z

Thanks for the very interesting paper and the implemetation - fabulous work!

I just have a question about the usage of CNN blocks in the model. Please correct me if I'm wrong - it seems that a neural ODE with a single conv block will lead to infinite receptive field even for a very short integration time. For discrete ODE-based designs such as Euler Net and Runge-Kutta Net, the receptive fields are all finite and dependent on the number of layers/blocks. If so, (1) it seems that a conv block degenerates into a FC layer applied on the entire (flattened) input, since the concept of "receptive field" no loger holds in this case, and (2) it seems that there's no need to use a larger/deeper model for neural ode if the target is to cover a large, possibly global receptive field - a single block (perhaps together with HyperNet) should be enough for everything. I'm not sure if this assumption still holds for larger models such as ResNet50/ResNet101 on larger datasets, but my intuition is that a single conv-block ODE might be hard to hit on par performance with them (e.g. a comment in #32 about the performance on CIFAR10). So I'm also wondering if you have done any numerical experiments on larger datasets and compare neural ODE with larger, especially deeper models.

Thanks in advance!

yluo42 · 2019-05-02T23:33:26Z

As a followup, I tried neural ODE in the architecture of denoising autoencoder (DAE) on audio clips (since my focus are mainly on audio processing tasks). I use a ResNet-style encoder to map the noisy input to a latent space, and use another ResNet-style decoder to reconstruct the clean input. I tried two architectures: one with the standard training (input -> Enc -> Dec -> output), another one with an extra ODE block between the encoder and the decoder (input -> Enc -> ODE block -> Dec -> output). The ODE block was integrated at time [0, 1] and only the last output was used for training. The intuition here is that the ODE block should learn a dynamic in the latent space that moves the noisy feature to the clean feature for reconstruction, and I wanted to see if this was trainable with such deep architecture. Both encoder and decoder consisted of 8 1-D CNN blocks with per-layer residual connection., and the input to the model is the raw waveform of the noisy audio clips (no STFT). The ODE block contained 2 1-D CNN blocks similar to the MNIST example script.

I ran both configurations 5 times and choose the best results on a small dataset (2 hrs of audio). My observations were that not only the standard one (without ODE block) significantly outperformed the ODE one regarding SNR metric (by around 170%), but also the ODE one was really sensible to initialization and learning rate. Moreover, adding the ODE block led to 6 times slowing down on the training speed (with adjoint method). This experiment may not be that representative across all the possible problem settings and tasks in various areas, but at least my trials here are not showing that the neural ODE block will always be effective and can easily replace residual blocks in deep architectures (especially between deep sub-modules in a system?). I'm curious about observations in different tasks/models/datasets to see if I'm the only one having this issue.

rafaelvalle · 2019-05-03T02:35:24Z

Did you try using a different non-linearity and norm on the Neural ODE itself?
Did you check the norm of the gradients on the encoder with the Neural ODE?
Did you consider training the NODE starting from the encoder and decoder that were previously trained and work better? ~~IIRC at the first iteration the output of the NODE just be close to identity, so it shouldn't get worse than that...~~

yluo42 · 2019-05-03T03:48:16Z

Did you try using a different non-linearity and norm on the Neural ODE itself?

I tried ReLU/PReLU/Tanh, and I didn't observe much difference. I can try more on learning rates, but I'm not that optimistic on it.

Did you check the norm of the gradients on the encoder with the Neural ODE?

I haven't, but I always apply gradient clipping so it might not be the main issue? I should take a look at that though.

Did you consider training the NODE starting from the encoder and decoder that were previously trained and work better? IIRC at the first iteration the output of the NODE just be close to identity, so it shouldn't get worse than that...

I can definitely try that. Why is it that the first iteration of NODE goes to identity? It doen't look true to me.

rafaelvalle · 2019-05-03T04:41:57Z

Check the norm of the gradients just to confirm they are not vanishing even though they should not.
The identity comment was my mistake. It should not be identity, it should be closer to a linear transformation.

yluo42 · 2019-05-03T05:36:13Z

The gradients in ODE block are in the same scale as those in encoder/decoder (and not vanishing), so maybe not this reason. Let me try something like concatenating t into the input to make it not autonomous and see what happends.

Beyond the model itself I'm more interested in the "infinite receptive field" argument, since that sometimes plays an important role in long-range sequence modeling tasks. I would really like to see what's the role of conv blocks comparing with FC layers in this situation.

rafaelvalle · 2019-05-03T05:51:35Z

concatenating t into the input to make it not autonomous and see what happends.

Concatenating t into the input of what?

yluo42 · 2019-05-03T05:56:45Z

To the input of the conv layers in the ODE block, like what they do in the MNIST example (and the FFJORD project):

torchdiffeq/examples/odenet_mnist.py

Lines 76 to 89 in 8c60789

    
           class ConcatConv2d(nn.Module): 
        
               def __init__(self, dim_in, dim_out, ksize=3, stride=1, padding=0, dilation=1, groups=1, bias=True, transpose=False): 
        
                   super(ConcatConv2d, self).__init__() 
        
                   module = nn.ConvTranspose2d if transpose else nn.Conv2d 
        
                   self._layer = module( 
        
                       dim_in + 1, dim_out, kernel_size=ksize, stride=stride, padding=padding, dilation=dilation, groups=groups, 
        
                       bias=bias 
        
                   ) 
        
               def forward(self, t, x): 
        
                   tt = torch.ones_like(x[:, :1, :, :]) * t 
        
                   ttx = torch.cat([tt, x], 1) 
        
                   return self._layer(ttx)

rafaelvalle · 2019-05-03T06:00:03Z

surprised to hear that the convs in your ode block did not have t as input

yluo42 · 2019-05-03T06:01:53Z

Mainly because in one previous (closed) issue the author mentioned that for MNIST with or without t didn't lead to difference, so I went without it for this small dataset.

rafaelvalle · 2019-05-03T06:02:41Z

What is that issue number? Can you share a link?

yluo42 · 2019-05-03T06:05:39Z

Sure, it's in #14 (comment).

rtqichen · 2019-05-03T21:49:24Z

I don't think the receptive field is a clear indication of a model's performance, or we would all be using larger than 3x3 convolution layers. You're right that the effective receptive field (theoretically) would be infinite, but in practice depends on the number of actual evaluations. (This is an interesting observation!) Though I don't think it degenerates into a FC block, as an (infinite) series of stacked convolutional layers still uses locality assumptions and should not be able to have the same degrees of freedom as a true FC layer (though I'm not entirely sure). Perhaps a more meaningful measure would relate to the complexity of the transformation, rather than just the locality of the dependencies.

An ODEBlock's behavior is quite sensitive to the topology of the underlying hidden space, and a "difficult-to-navigate" space would require a complex (and slow to solve) ODE. A recent work (https://arxiv.org/abs/1904.01681) showed that simply padding zeros and using an ODE on the pixel space is sufficient for classification, but in general having a good initial hidden space should help significantly especially for tasks where the output is of higher dimension.

I'm not sure why adding an ODEBlock to your autoencoder actually made performance worse for you. Most tricks for initializing residual nets (like zeroing the weights of the last layer) should help for ODEs as well. This will initialize the ODE as an identity.

rafaelvalle · 2019-05-03T21:59:06Z

Thanks for bringing up the zeros initialization as means to have the ODE as an identity!

yluo42 · 2019-05-03T22:21:21Z

Thanks for the thorough reply! Yes I don't think receptive field is an indicator of the performance either, I'm just curious about how it would affect the feature extraction process. For example, would it sometimes even be harmful to tasks that strictly require locality?

And also thanks for pointing out the recent paper - I skimmed through it and it's really interesting. I think the main idea there is just similar to apply a kernel to the feature space so that the dynamics are easier to learn like the standard kernel methods, and maybe other (nonlinear) kernels would have similar observations with increased model complexity than this simple zero-padding paradigm. I think my case here is pretty similar - the dynamics in the latent space might be too hard for a simple ODEBlock to learn. I'll play with the initialization/design/problem formulation to see if it can be better.

yluo42 closed this as completed May 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receptive field in CNN-based architecture, and more about the usage of conv blocks #50

Receptive field in CNN-based architecture, and more about the usage of conv blocks #50

yluo42 commented May 2, 2019 •

edited

Loading

yluo42 commented May 2, 2019 •

edited

Loading

rafaelvalle commented May 3, 2019 •

edited

Loading

yluo42 commented May 3, 2019

rafaelvalle commented May 3, 2019

yluo42 commented May 3, 2019

rafaelvalle commented May 3, 2019

yluo42 commented May 3, 2019

rafaelvalle commented May 3, 2019 •

edited

Loading

yluo42 commented May 3, 2019

rafaelvalle commented May 3, 2019

yluo42 commented May 3, 2019

rtqichen commented May 3, 2019

rafaelvalle commented May 3, 2019 •

edited

Loading

yluo42 commented May 3, 2019 •

edited

Loading

Receptive field in CNN-based architecture, and more about the usage of conv blocks #50

Receptive field in CNN-based architecture, and more about the usage of conv blocks #50

Comments

yluo42 commented May 2, 2019 • edited Loading

yluo42 commented May 2, 2019 • edited Loading

rafaelvalle commented May 3, 2019 • edited Loading

yluo42 commented May 3, 2019

rafaelvalle commented May 3, 2019

yluo42 commented May 3, 2019

rafaelvalle commented May 3, 2019

yluo42 commented May 3, 2019

rafaelvalle commented May 3, 2019 • edited Loading

yluo42 commented May 3, 2019

rafaelvalle commented May 3, 2019

yluo42 commented May 3, 2019

rtqichen commented May 3, 2019

rafaelvalle commented May 3, 2019 • edited Loading

yluo42 commented May 3, 2019 • edited Loading

yluo42 commented May 2, 2019 •

edited

Loading

yluo42 commented May 2, 2019 •

edited

Loading

rafaelvalle commented May 3, 2019 •

edited

Loading

rafaelvalle commented May 3, 2019 •

edited

Loading

rafaelvalle commented May 3, 2019 •

edited

Loading

yluo42 commented May 3, 2019 •

edited

Loading