# Convolutional neural networks

## Why?

- Fully Connected networks look at input vector as an independent features
- Images are __clearly__ dependent; one can estimate pixel values from the ones surrounding it (or vice versa)
- Convolution takes this interaction an account (relative position of feature is used)
- Convolution models spatial relationships between features (not only images, but also soundwaves, 3D models, videos, words in sentences etc.)
- Smaller number of parameters allows us to use deeper and more complicated layers
- Simpler architecture (less parameters), tailored to spatially structured data
- __Convolutional neural networks find higher level features (a.k.a. representations) useful for final classification layers__

## What is convolution?

> Kernel (vector/matrix/cube) "sliding" over input data (multiplying it with kernel values) and summing them together

For binary case:

![image](images/convolution_animation.gif)

- Kernel is applied on first image patch. It's values are multiplied by respective values in the image
- After multiplication all of those values are summed __returning single element__ (you can think of it as a new pixel.
- This process is repeated until end of image is reached and new matrix is created.

Below are change'able arguments of convolution (those are it's __hyperparameters__):

# Convolution parameters

## Kernel size

> Dimensionality of kernel. In case above it was `3x3` kernel (pretty popular choice).

- It can be specified as a tuple, e.g. `(3, 3)`
- It can be irregular (e.g. `(3, 2)`), __though it is rarely a case__ (if it is, it almost always is `(N, 1)` or `(1, N)`)
- The larger the kernel, the larger is it's __receptive field__ but more computations have to be performed

## Stride

> Number of pixels we shift our kernel in a certain direction. In case above it was `(1, 1)`

- It can be specified as a tuple, e.g. `(1, 1)`
- It can be irregular (e.g. `(2, 1)`), __though it is almost never a case__ (specific use cases, __we shouldn't be concerned with this possibility__
- The larger the stride, the more features from original image we miss
- The larger the stride, the smaller output image becomes
- The larger the stride, the less operations have to be performed
- __Due to above `stride=1` is the most common value__


## Padding

> Addition values around the image (usually zeros). In case above __there was no padding (a.k.a. "valid" padding)__

![image](images/CNN_diagram.JPG)

- It can be specified as a tuple, e.g. `(1, 1)`
- One can choose from a few modes, specifically (see [`torch.nn.Conv2d`](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html)):
    - `zeros` - output is padded with zero values; __most common & default__
    - `reflect` - output is reflected, same as `replicate` for `1`, after that consecutive neighboring values ared used; __used for medical imaging, sometimes image segmentation etc.__
    - `replicate` - last value at the border is used; __rare__
    - `circular` - first and consecutive values are used; __rare__

### Okay but why?

> Without padding image shrinks (when `kernel_size` > 1)

Assume we stack a few convolutions, one after another. __After a while our image will be a single pixel!__

- If we add padding we can preserve image's size without introducing noise (or at least too much of it)
- Pixels on the edge do not contribute as much to the kernel. If we add appropriate padding they contribute the same as the rest.
- Acts as a mild regularizer (depending on the mode)

### Tips

- Specifying `padding` with "default values" (like `stride=1`, `dilation=1`) is easy and can be done using:

$$
\lfloor\frac{\text{kernel_size}}{2}\rfloor
$$

- Exact output size dependent on parameters is provided by PyTorch in [Shape](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html)
- Some libraries provide padding `"same"` which calculates padding for you dynamically ([torchlayers](https://github.com/szymonmaszke/torchlayers) for PyTorch, Keras & Tensorflow have it out of the box)

## Dilation

> Dilation means spacing between kernel elements

![dilated](images/dilated.jpg)

- Increased receptive field
- Some pixels are missed but this information can be approximated
- __Also called `atrous convolution`__

### Tips

- Useful for very large images
- Useful when we want to get more global representations of images
- Most useful in the early layers
- Can be concatenated with standard convolution

# Convolutional layer

## History

> Originally, people created those kernels "by hand"

For example __Sobel filter__ is used to find edges in the image (read more [here](https://en.wikipedia.org/wiki/Sobel_operator))

- Those were very specific (edge detection, face detection, image gradient detection)
- Hard to come up with
- Non-specific to images

In 1989 [Yann LeCun](http://yann.lecun.com/ex/research/index.html) came up with an idea to make convolution a neural network layer to solve above shortcomings.

## Modern era

> Convolutional filter in Deep Learning has learnable connections instead of hard-fixed values

This approach solves all of the problems outlined in `History`, but we need some more nomenclature to fully understand this idea:

## Input channels

> Number of channels entering convolutional layer

![image](images/CNN_RGB.JPG)

- __Each input channel has it's own set of kernels (which are now weights)!__
- Given above we already have weights of shape `(in_channels, width, height)`

__Usually, during first convolutional layer, `in_channels=3` (Red, Green Blue) or `1` (grayscale images)

## Output channels

> Number of channels created by convolution operation (number of input channels doesn't matter)

- Each output channel (with it's filters) convolves over __all input channels__ and sums the result
- Given above, we have weights of shape `(out_channels, in_channels, width, height)`

__Finally data is produced of shape `(batch, out_channels, width, height)`__

## Filters

> Collection of kernels (sometimes named as channels)

In case of `Conv2d` it will be `(in_channels * out_channels)`

# Benefits of convolution

- Much smaller amount of parameters (when compared to `nn.Linear`)
- Same parameters go over regions of an image (with `nn.Linear` each parameter would be responsible for one pixel)
- Tailored for this specific task (architecture)
- __Much__ higher performance on spatial tasks

### Exercise

__Use `torch.nn` package for layer creation__

__Layers should be placed in a generator/list and iterated over at the end!__

- Create `Conv1d`, `Conv2d` and `Conv3d` with given `in_channels`, `out_channels` and `kernel_size`.
- Create `nn.Linear` with specified `in_features` and `out_features` by `flattened_input_images_sizes` and `flattened_output_images_sizes` elements respectively (__tip:__ use `zip` function and list comprehension inside the top level list)
- Iterate over created layers:
    - `print` their name (each layer has it's textual representation defined with `__str__`)
    - `print` their shape
    - `print` total number of parameters
    
__Analyze the results in groups and come up with conclusions__

In [3]:
import torch

in_channels = 32
out_channels = 32
kernel_size = 5

flattened_input_images_sizes = [28 * 28, 32 * 32]
flattened_output_images_sizes = [28 * 28, 32 * 32]

# Your code here 
...

Conv1d(32, 32, kernel_size=(5,), stride=(1,))
torch.Size([32, 32, 5])
69
Conv2d(32, 32, kernel_size=(5, 5), stride=(1, 1))
torch.Size([32, 32, 5, 5])
74
Conv3d(32, 32, kernel_size=(5, 5, 5), stride=(1, 1, 1))
torch.Size([32, 32, 5, 5, 5])
79
Linear(in_features=784, out_features=784, bias=True)
torch.Size([784, 784])
1568
Linear(in_features=1024, out_features=1024, bias=True)
torch.Size([1024, 1024])
2048


## What does each filter look for?

> Here's what some simple, small, 1 channel filters might look like after they've been trained.

![](images/filters.png)

> Convolutional neural networks are often represented by diagrams like the one below:

![](images/cnn.png)

# Pooling

> Pooling allows us to control when we want to shrink image's width and height

![](images/maxpoolfig.gif)

## Why?

- As the network gets depeer __we increase number of channels__ in order to learn more abstract representations, hence __computational cost increases quickly__
- Pooling allows us to control computational cost of operations
- Pooling chooses the most important features from the image

## Versions

There are a couple versions of pooling, most prevalent are:
- `MaxPooling`
- `AvgPooling` (taking average of kernel values)

### MaxPooling

- Chooses most important features
- Sharper decisions
- __Might__ be easier to train but __might__ be worse on validation
- __Most popular__
- Suitable for large networks with enough capacity (layers) to find most important features

### AvgPooling

- Takes all features into an account
- Smoother decision boundary
- __Might__ be harder to train but __might__ be better on validation
- __Less popular__
- Suitable for smaller networks with lower capacity as it doesn't leave any feature behind

## Tips

- At the start of neural network __do not use pooling__. Go for a couple layers/blocks (like 3/4) and pool after that
- After initial convolution layers you may pool every 2 layers/blocks (though you might go for more)
- Provided as [torch.nn.AvgPool2d](https://pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html) 


# Global Pooling

> At the end of neural network we often need to output `(batch_size, classes)` (for classification

Due to above, we need to go from shape `(batch_size, channels, width, height)`. Given that:
- Abstract features are gathered in channels, __not width and height__

We can use so called `GlobalPooling`, provided in PyTorch as [`AdaptiveMaxPool2d`](https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveMaxPool2d.html) (specify `1` to get `GlobalPooling`):

> Global pooling works like a normal pooling, but __always return image with single pixel__, hence output shape will be `(batch_size, channels, 1, 1)` (__remember to `squeeze` dimensions before passing them to `torch.nn.Linear`!)

__Global pooling also comes in a few flavors, including `max` and `avg`__

# Exercise

__Create convolutional neural network on your own!__

See comments below for characteristics:

In [None]:
# Create your neural network here
# in_channels=3, classes=1000
# Go for 10 layers, pooling after layer 4, 6, 8
# Global pooling after 10th layer and end with `nn.Linear(channels, classes)
# use nn.Module

# ResNets

__One of the most important papers in deep learning [link](https://arxiv.org/abs/1512.03385)__

> Instead of learning feature transformations as we add convolutional layers, __we learn corrections__ to the previous layers

Let's assume `F(x)` is a block containing two convolutional layers transforming `x` inputs. In ResNets case we would add a __skip connection__:

![resnet_equation](images/resnet_equation.png)

## Degradation problem

> When the layer is too deep (on the order of `100` convolutional layers) __it is too hard to optimize__

Due to that:
- __Deeper networks have larger train loss than more shallow ones!__ `50` layers would perform better than `1000` (which is not intuitive)

> __ResNets main addition is solving degradation problem__, due to that we can easily go with `1000` layer networks (though it's __almost always__ not needed for most tasks)

> Anywhere from `18` to `152` layers should be enough for any task (use the smallest one satisfying your needs


## Additional benefits

- Further (after `BatchNorm`) loss landscape smoothing
- Further reduction in vanishing/exploding gradient (though mostly taken care of by `BatchNorm` already)

## Usage tips

- Use any block and connect it using skip connection
- Usability is not limited to `convolution`, same thing applies for `linear` (though rarely for recurrent neural networks)
- __Try to keep the same size of `inputs` and `outputs` in order not to use projection__, a few resnet blocks with the same number of channels and up the number two times
- __DO NOT USE ACTIVATION AT THE END OF LAST LAYER INSIDE RESNET BLOCK__

## Exercise

__Create `Residual` block!__

- `__init__`:
    - Take two arguments, one `module`, another one `projection` with default value `None`, both being `torch.nn.Module` instances!
- `forward`:
    - Pass `inputs` through `self.module` creating `outputs` variable
    - If `self.projection` is not `None`, modify `inputs` by passing them through `self.projection`
    - Finally add `outputs` and `inputs` (with optional projection) together
    
Create a few residual blocks (with any convolutions inside it), use projection if needed.

In [None]:
class Residual(torch.nn.Module):
    ...

## Summary

- Convolutional layers are used with spatial dependent data (usually images)
- Can be used for any task just like `nn.Linear` layers
- Different versions exist (`1D`, `2D`, `3D`) for different tasks, but usually, respectively:
    - Textual (encoded) data or audio
    - Images
    - Videos (rarest case due to large amount of data)
- __Convolution can work on images of any size__ (only `channels` dimension has to be the same)
- Convolution has trainable kernels of specified size which together form filters
- We increase number of channels while simultaneuosly reducing image's size in order to learn more abstract features
- Neural network usually ends with global pooling (and optionally some task specific Linear layers, though modern architectures tend to go for single `nn.Linear`)
- Skip connections are used to combat degradation problem
- Skip connections allow us to use way deeper networks and optimize the system easier

## Challenges

- What are pretrained models and how to use them properly?
- What is depthwise convolution?
- What is pointwise convolution?
- If you know above, `Separable` convolution is depthwise followed by pointwise convolution
- How does MobileNetV2 work and why is it useful?
- What is Squeeze-Excitation block, what are the upsides and downsides of adding it? Where should we add it? Read [Squeeze-Excitation research paper](https://arxiv.org/abs/1709.01507)
- Read [EfficientNet research paper ](https://arxiv.org/abs/1905.11946) to know current SOTA architecture on ImageNet
- Some additional concepts that one might want to read: shuffle nets, inception blocks, 
- How does attention on images work? Check attention on your own or __come back after attention classes__