Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unet decoder upsampling #187

Closed
DamienLopez1 opened this issue Apr 22, 2020 · 12 comments
Closed

Unet decoder upsampling #187

DamienLopez1 opened this issue Apr 22, 2020 · 12 comments

Comments

@DamienLopez1
Copy link

Hi

I am using a Unet model with the encoder set to 'resnet34' and the pretrained weights are imagenet.

When I look at the model I do not see where the upsampling is occuring. The convolutions in the encoder side are occuring (although the downsampling is seemingly occuring after the intended layer e.g downsampling from layer 1 to layer 2 only occurs after layer 2), however I do not see where the upsampling takes place in the decoder side.

There is also the case where I do not see the centre block convolutions occuring.

Can I please be explained where the upsampling occurs?

My model for reference:

resnet34 Unet model.txt

@howard-mahe
Copy link

howard-mahe commented Apr 24, 2020

Hello, in this UNet implementation, the upsampling operator of the decoder is a nearest neighbor interpolation which uses the PyTorch functional api, see:

x = F.interpolate(x, scale_factor=2, mode="nearest")

Since the functional operators are parameter-free, the upsampling operator in UNet does not appear in the model.modules().

By default, the center block is disabled, see:

Regarding the downsampling operators, you cannot get how they are used by looking at the model.modules() when the nn.Module is not a nn.Sequential. Instead you must look at the forward pass in unet/decoder.py.

Finally, in this library, a stage is implicitly defined as a group of operators (Conv/BN/ReLU/Pool) that outputs activations of the same spatial resolution. To satisfy this implicit definition, one must place the pooling from spatial resolution 1 to spatial resolution 2 in the beginning of a layer. Example:

  • the outputs spatial resolution of [Conv, Conv, Pool s=2] are e.g. 1/4, 1/4, 1/8
  • the outputs spatial resolution of [Pool s=2, Conv, Conv] are e.g. 1/4, 1/4, 1/4 => same spatial resolution over the whole layer

Please note that DenseNets (at least) violate the aforementioned principle.

EDIT: The "implicit" definition of a stage seems to rather be: a group of operators that outputs activations with the same number of channels.

@DamienLopez1
Copy link
Author

Hi Howard

For semantic segmentation should the centre block be activated? I have not activated it and my results seem to be very good(dice loss of 0.20 and IOU score of 0.767).

With regards to:

the outputs spatial resolution of [Conv, Conv, Pool s=2] are e.g. 1/4, 1/4, 1/8
the outputs spatial resolution of [Pool s=2, Conv, Conv] are e.g. 1/4, 1/4, 1/4 => same spatial resolution over the whole layer

How do we modify this spatial resolution?

And finally in using the segmentation head, my segmentation head is using an identity upsample rather than nn.UpsamplingBilinear2d. How do I set upsampling > 1? Is it even necessary?

@JulienMaille
Copy link
Contributor

@howard-mahe Sorry to hijack the thread, I had a question regarding the downsampling part of the encoder. I'm trying to teach my network to recognize small features that quickly disappear after 2 or 3 downsamplings. In order to improve my IoU I was wondering if it would make sens to increase the feature channel of the first layers of the encoder. Do you have any literature to recommend on this topic?

@howard-mahe
Copy link

howard-mahe commented Apr 24, 2020

@DamienLopez1
The center block that is used in the original paper can be seen as part of the encoder. The center block is useless when you use deep backbones as encoder (it is always the case in this library) because the deep backbones have enough convolutions in their last stage (e.g. 3 BasicBlock in the last stage of resnet34, see encoders/resnet.py#L87-88). So yes, keep disabled the center block.

I am not sure I got your question regarding the modification of the spatial resolution. Actually, all the encoders are already defined so that each stage decreases by 2 the spatial resolution. The default number of stages is 6 (see encoders/resnet.py#L46-L54 for resnets) where the output spatial resolutions are [1, 1/2, 1/4, 1/8, 1/16, 1/32]. If you want your encoder to be "less deep", you can control the number of stages using the encoder_depth argument. Otherwise, if you want to maintain the spatial resolution while not modifying the depth of the encoder, you can apply a dilation rate (see make_dilated) on the convolutions of the last stages, only if if the encoder supports this feature.

Regarding the upsampling, what do you mean by "identity" upsampling? Do you mean nearest neighbor upsampling? If you do not upsample the coarse features, you won't be able to fuse features from different stages.


Hi @JulienMaille. Because of the pooling operators (or convolution stride=2), indeed some content might disappear from the feature maps after few stages. Deep backbones are designed to capture both local and global information, hence the theoretical receptive field of a resnet is generally above 200x200. If you are interested in small features, I imagine you do not want to increase the receptive field of your network. So first, do not use pooling operators and second do not make the network too deep. In the literature, Wu et al. have applied such principles to VGG to detect manipulations in images:

Wu, Y., AbdAlmageed, W., & Natarajan, P. (2019). Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9543-9552). (link).

Hence the authors maintained the receptive of field of their "fully convolutional VGG" at 23x23.

@DamienLopez1
Copy link
Author

Thanks Howard that helps alot.

Regarding the identity upsampling, this is my models segmentation head:

(segmentation_head): SegmentationHead(
(0): Conv2d(16, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Identity()
(2): Activation(
(activation): Sigmoid()
)
)

The upsampling is set to identity. Why would this be?

@howard-mahe
Copy link

howard-mahe commented Apr 24, 2020

In UNet architecture, the UnetDecoder already upsamples the features up to the input image size, but this might not be the case for other decoders. In the particular case of UNet, you do not need to upsample the features further. That's why the upsampling argument of the SegmentationHead (see base/heads.py) is not set in the definition of UNet (unet/model.py). If you look at the segmentation head implementation, you will find this:

upsampling = nn.UpsamplingBilinear2d(scale_factor=upsampling) if upsampling > 1 else nn.Identity()

The segmentation head of UNet matches the case where upsampling == 1 , hence upsampling == nn.Identity() because we don't need to upsample the features further.

@DamienLopez1
Copy link
Author

Thanks alot Howard. I get it now!

@howard-mahe
Copy link

howard-mahe commented Apr 24, 2020

@JulienMaille By pooling operators I mean any operator with a stride not equal to 1. There are five of them in any resnet in order to reduce the spatial resolution by a factor 1/32 at the end of the encoder. For your very specific usecase, I imagine you probably don't need to reduce the spatial resolution at all (just like in ManTra-Net) in order to maintain a small receptive field. replace_stride_with_dilation won't help you because its purpose is to preserve the receptive field. Please note that if you don't reduce the spatial resolution in the encoder, you don't need a decoder at all. In addition, I don't think this is a good idea to reduce drastically the number of convolutions of your network because you need it to be deep enough to build a representation of the data.

Here follows for instance the implementation of the VGG in ManTra-Net:

from torchvision.models.vgg import cfgs, make_layers

# Remove pooling operators from cfgs
cfgs = {name: list(filter(lambda a: a != 'M', cfg)) for name, cfg in cfgs.items()}

vgg11 = make_layers(cfgs['A'])
vgg13 = make_layers(cfgs['B'])
vgg16 = make_layers(cfgs['D'])
vgg19 = make_layers(cfgs['E'])

If you have a lot of training data (1M+), you will be able to train such network from scratch. If you don't, well... you will have a hard time because pre-trained deep neural nets are designed with pooling operators which aim at aggregating information in a large receptive field. You cannot use those pre-trained models if you remove the pooling ops, except the very first stage of pre-trained models which works at full resolution.

@DamienLopez1 DamienLopez1 reopened this Apr 24, 2020
@DamienLopez1
Copy link
Author

DamienLopez1 commented Apr 24, 2020

Maybe this should be another issue, but if I can maybe get an explanation as to why DICE loss works as a loss function for Unet rather than the Softmax with BCE loss as said in the Unet paper? I am doing a background vs single class segmentation and am using a sigmoid activation.

I am thinking that because its single class and sigmoid, this is why dice loss is permissable? Am I on the right track?

@howard-mahe
Copy link

howard-mahe commented Apr 24, 2020

You are right, this should be another issue.

Well, I am not an expert of binary losses. But the point is that UNet was designed in 2014. Since then, practitioners have found that DICE loss can be a good substitute or even complementary to BCE Loss for binary segmentation tasks.

Regarding softmax vs. sigmoid: for the binary segmentation tasks, you better go with the sigmoid and predict one heat map while the softmax must be used for multi-class segmentation tasks. However the softmax can be used for binary segmentation tasks when the number of classes K is set to 2, but this is a redundant formulation compared to sigmoid.

@DamienLopez1
Copy link
Author

Thanks alot Howard, that clears things up for me.

@askerlee
Copy link

askerlee commented Aug 23, 2020

Hi @howard-mahe , have you ever tried to use bilinear interpolation (or trilinear interpolation for 3D input)? In my experiments, bilinear interpolation always performs better than nearest neighbors, and yields smoother masks. (My model is not U-net based, so not totally sure whether this conclusion generalizes to U-net.)

Hello, in this UNet implementation, the upsampling operator of the decoder is a nearest neighbor interpolation which uses the PyTorch functional api, see:

x = F.interpolate(x, scale_factor=2, mode="nearest")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants