Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receptive field in CNN-based architecture, and more about the usage of conv blocks #50

Closed
yluo42 opened this issue May 2, 2019 · 14 comments

Comments

@yluo42
Copy link

yluo42 commented May 2, 2019

Thanks for the very interesting paper and the implemetation - fabulous work!

I just have a question about the usage of CNN blocks in the model. Please correct me if I'm wrong - it seems that a neural ODE with a single conv block will lead to infinite receptive field even for a very short integration time. For discrete ODE-based designs such as Euler Net and Runge-Kutta Net, the receptive fields are all finite and dependent on the number of layers/blocks. If so, (1) it seems that a conv block degenerates into a FC layer applied on the entire (flattened) input, since the concept of "receptive field" no loger holds in this case, and (2) it seems that there's no need to use a larger/deeper model for neural ode if the target is to cover a large, possibly global receptive field - a single block (perhaps together with HyperNet) should be enough for everything. I'm not sure if this assumption still holds for larger models such as ResNet50/ResNet101 on larger datasets, but my intuition is that a single conv-block ODE might be hard to hit on par performance with them (e.g. a comment in #32 about the performance on CIFAR10). So I'm also wondering if you have done any numerical experiments on larger datasets and compare neural ODE with larger, especially deeper models.

Thanks in advance!

@yluo42
Copy link
Author

yluo42 commented May 2, 2019

As a followup, I tried neural ODE in the architecture of denoising autoencoder (DAE) on audio clips (since my focus are mainly on audio processing tasks). I use a ResNet-style encoder to map the noisy input to a latent space, and use another ResNet-style decoder to reconstruct the clean input. I tried two architectures: one with the standard training (input -> Enc -> Dec -> output), another one with an extra ODE block between the encoder and the decoder (input -> Enc -> ODE block -> Dec -> output). The ODE block was integrated at time [0, 1] and only the last output was used for training. The intuition here is that the ODE block should learn a dynamic in the latent space that moves the noisy feature to the clean feature for reconstruction, and I wanted to see if this was trainable with such deep architecture. Both encoder and decoder consisted of 8 1-D CNN blocks with per-layer residual connection., and the input to the model is the raw waveform of the noisy audio clips (no STFT). The ODE block contained 2 1-D CNN blocks similar to the MNIST example script.

I ran both configurations 5 times and choose the best results on a small dataset (2 hrs of audio). My observations were that not only the standard one (without ODE block) significantly outperformed the ODE one regarding SNR metric (by around 170%), but also the ODE one was really sensible to initialization and learning rate. Moreover, adding the ODE block led to 6 times slowing down on the training speed (with adjoint method). This experiment may not be that representative across all the possible problem settings and tasks in various areas, but at least my trials here are not showing that the neural ODE block will always be effective and can easily replace residual blocks in deep architectures (especially between deep sub-modules in a system?). I'm curious about observations in different tasks/models/datasets to see if I'm the only one having this issue.

@rafaelvalle
Copy link
Contributor

rafaelvalle commented May 3, 2019

Did you try using a different non-linearity and norm on the Neural ODE itself?
Did you check the norm of the gradients on the encoder with the Neural ODE?
Did you consider training the NODE starting from the encoder and decoder that were previously trained and work better? IIRC at the first iteration the output of the NODE just be close to identity, so it shouldn't get worse than that...

@yluo42
Copy link
Author

yluo42 commented May 3, 2019

Did you try using a different non-linearity and norm on the Neural ODE itself?

I tried ReLU/PReLU/Tanh, and I didn't observe much difference. I can try more on learning rates, but I'm not that optimistic on it.

Did you check the norm of the gradients on the encoder with the Neural ODE?

I haven't, but I always apply gradient clipping so it might not be the main issue? I should take a look at that though.

Did you consider training the NODE starting from the encoder and decoder that were previously trained and work better? IIRC at the first iteration the output of the NODE just be close to identity, so it shouldn't get worse than that...

I can definitely try that. Why is it that the first iteration of NODE goes to identity? It doen't look true to me.

@rafaelvalle
Copy link
Contributor

Check the norm of the gradients just to confirm they are not vanishing even though they should not.
The identity comment was my mistake. It should not be identity, it should be closer to a linear transformation.

@yluo42
Copy link
Author

yluo42 commented May 3, 2019

The gradients in ODE block are in the same scale as those in encoder/decoder (and not vanishing), so maybe not this reason. Let me try something like concatenating t into the input to make it not autonomous and see what happends.

Beyond the model itself I'm more interested in the "infinite receptive field" argument, since that sometimes plays an important role in long-range sequence modeling tasks. I would really like to see what's the role of conv blocks comparing with FC layers in this situation.

@rafaelvalle
Copy link
Contributor

concatenating t into the input to make it not autonomous and see what happends.

Concatenating t into the input of what?

@yluo42
Copy link
Author

yluo42 commented May 3, 2019

To the input of the conv layers in the ODE block, like what they do in the MNIST example (and the FFJORD project):

class ConcatConv2d(nn.Module):
def __init__(self, dim_in, dim_out, ksize=3, stride=1, padding=0, dilation=1, groups=1, bias=True, transpose=False):
super(ConcatConv2d, self).__init__()
module = nn.ConvTranspose2d if transpose else nn.Conv2d
self._layer = module(
dim_in + 1, dim_out, kernel_size=ksize, stride=stride, padding=padding, dilation=dilation, groups=groups,
bias=bias
)
def forward(self, t, x):
tt = torch.ones_like(x[:, :1, :, :]) * t
ttx = torch.cat([tt, x], 1)
return self._layer(ttx)

@rafaelvalle
Copy link
Contributor

rafaelvalle commented May 3, 2019

surprised to hear that the convs in your ode block did not have t as input

@yluo42
Copy link
Author

yluo42 commented May 3, 2019

Mainly because in one previous (closed) issue the author mentioned that for MNIST with or without t didn't lead to difference, so I went without it for this small dataset.

@rafaelvalle
Copy link
Contributor

What is that issue number? Can you share a link?

@yluo42
Copy link
Author

yluo42 commented May 3, 2019

Sure, it's in #14 (comment).

@rtqichen
Copy link
Owner

rtqichen commented May 3, 2019

I don't think the receptive field is a clear indication of a model's performance, or we would all be using larger than 3x3 convolution layers. You're right that the effective receptive field (theoretically) would be infinite, but in practice depends on the number of actual evaluations. (This is an interesting observation!) Though I don't think it degenerates into a FC block, as an (infinite) series of stacked convolutional layers still uses locality assumptions and should not be able to have the same degrees of freedom as a true FC layer (though I'm not entirely sure). Perhaps a more meaningful measure would relate to the complexity of the transformation, rather than just the locality of the dependencies.

An ODEBlock's behavior is quite sensitive to the topology of the underlying hidden space, and a "difficult-to-navigate" space would require a complex (and slow to solve) ODE. A recent work (https://arxiv.org/abs/1904.01681) showed that simply padding zeros and using an ODE on the pixel space is sufficient for classification, but in general having a good initial hidden space should help significantly especially for tasks where the output is of higher dimension.

I'm not sure why adding an ODEBlock to your autoencoder actually made performance worse for you. Most tricks for initializing residual nets (like zeroing the weights of the last layer) should help for ODEs as well. This will initialize the ODE as an identity.

@rafaelvalle
Copy link
Contributor

rafaelvalle commented May 3, 2019

Thanks for bringing up the zeros initialization as means to have the ODE as an identity!

@yluo42
Copy link
Author

yluo42 commented May 3, 2019

Thanks for the thorough reply! Yes I don't think receptive field is an indicator of the performance either, I'm just curious about how it would affect the feature extraction process. For example, would it sometimes even be harmful to tasks that strictly require locality?

And also thanks for pointing out the recent paper - I skimmed through it and it's really interesting. I think the main idea there is just similar to apply a kernel to the feature space so that the dynamics are easier to learn like the standard kernel methods, and maybe other (nonlinear) kernels would have similar observations with increased model complexity than this simple zero-padding paradigm. I think my case here is pretty similar - the dynamics in the latent space might be too hard for a simple ODEBlock to learn. I'll play with the initialization/design/problem formulation to see if it can be better.

@yluo42 yluo42 closed this as completed May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants