# Deep Learning - Convolutional Neuron Network


<!-- TOC START min:2 max:4 link:true asterisk:true update:true -->
* [What you'll learn in this class](#what-you-will-learn-in-this-class)
* [Convolutional Neural Networks](#convolutional-neural-networks)
  * [General principle](#general-principle)
  * [Spatial arrangement](#spatial-arrangement)
  * [Local Connectivity](#local-connectivity)
  * [Parameter sharing](#parameter-sharing)
<!-- TOC END -->



## What you'll learn in this class

Convolutional neural networks are the result of a generalization of densely connected neural networks that we learned about in the last lesson. They have become very popular because they are particularly adapted to image analysis. We will see today what they are made of and how to use them.



## Convolutional Neural Networks

Convolutional neural networks are a variant of the neural networks we saw earlier. They are particularly good at solving problems related to images or other objects that have a spatial dimension - we will see what this means in detail in what follows. In order to better explain how these models work, we will use the example of handwritten number recognition.

### General principle

We have previously seen how the neurons in a network take as input the explanatory variables submitted to them in the case of the input layer, and the outputs of the previous layer in the case of all the following layers. What changes with convolutional neural networks is that the input of each neuron is treated a little differently to take into account the spatial caracteristics of the input object. The example below illustrates this idea:

![](https://drive.google.com/uc?export=view&id=1_s27XL6-gMvUdqAZcvBhdHcTvBE4Nsss)


For example, in the case of the recognition of handwritten numbers in images of 18 by 18 pixels, we have 324 variables that are the pixels of the image. In the case of a classical neural network, the inputs of the first layer would be each pixel. Each of these pixels will be assigned a specific weight for each neuron of the input layer. In the case of a convolutional neural network, it is different: here, we do not assign a specific weight to each element (pixels) of the input, rather the weigths form a window (called filter) that will travel all over the input to calculate the output. This comes from the idea that when we identify objects by vision, we detect patterns and geometrical properties of the object as a whole, and not point by point, until we reconstruct a complete image. Applying a common filter all over the input and no longer assigning a specific weight to every single point is intended to communicate to the network the importance of looking for small patterns in order to understand images.

<img src="https://drive.google.com/uc?export=view&id=1l2JTEFItRHbC9QKdEUgJ7pyZXhxAc7ws" alt="conv">

In the figure above, each neuron corresponds to the following function:


![](https://drive.google.com/uc?export=view&id=16VT0OtxmHrQPecpUsw5U9zZqmqdwl_rX)


Where the $x_{i,j}$ represent the values of the pixel in position $i,j$ on the filter (for example $x_{3,1}$ is the value of the pixel at the bottom left of the filter)
the $w_{i,j}$ are the weights or parameters of the considered neuron, associated to the corresponding pixel.
and $b$ is the bias parameter calculated for each neuron.

The fact of moving the filter the different areas of the image defines the "convolution".



### Spatial arrangement

In the case of densely connected neural networks, the only hyper-parameters we have to choose are the number of layers and the number of neurons we want to place on each layer. Here, other hyper-parameters have to be chosen on top of those, depending on the objects that we want the network to analyze. Let's take the example of the hand-written digits images, the hyper-parameters to be determined are :

* **Filters**: several neurons can be used to analyze the same zone of the input object, the number of neurons thus dedicated to a precise zone is called the depth of a layer of the convolution network.
* **Stride**: we have to choose how the filter will move across the input. In the previous figure, the stride was chosen equal to $1$, because we were shifting the filter by one pixel between two positions. The larger the stride, the smaller the neuron output.
* **Kernel_size**: we need to define the dimensions of the filters we want to use. In the case of our example, we have chosen dimensions of 3 pixels by 3 pixels. The bigger the the kernel_size, the bigger the patterns we are asking our convolutional neurons to detect.
* **Padding**: the padding consists in adding pixels around the image to artificially increase its size, this will enable the filter to travel to positions on the image input it would not have been able to travel otherwise, specifically on the input's borders. In Keras you'll be able to manually set the padding, or choose between two options : `valid` which will pick the minimum legal padding for the input to be processed by the neurons in the convolutional layer, and `same` which will pad the input so that the output shape is equal to the input shape (this is especially useful when building very deep networks). In the example the output that we get is the same shape as the input (in this case the image of size 18 by 18 pixels). 

We show in the figure below what padding means:


<table>
  <tr>
   <td>

<img src="https://drive.google.com/uc?export=view&id=1GGLejkBqaLedIG4swumHWHuwchuSOkmI" alt="padding-1">

   </td>
   <td>

<img src="https://drive.google.com/uc?export=view&id=1j_BkYZPSa6ZJjoaxoiAyPJVawlZSAlAN" alt="padding-2">

   </td>
  </tr>
</table>


In the figure on the left, without padding, the possible positions for the center of the filter size $3 \times 3$ are shown in red, and the area covered by all the filters is indicated by the dotted lines. At the output of the filter, we will thus have $16 \times 16$ output elements, which gives an object smaller than the input image. The padding consists in artificially surrounding the image with a layer of pixels, so that we can choose the shape of the output. Here, for example, the chosen padding size is $1$, and the possible positions of the filter's centers now cover the whole image and the output will be $18 \times 18$.



### Local Connectivity

The filter system reduces the number of parameters that the network will need to optimize during its learning process. With densely connected layers, each neuron will include $18 \times 18 = 324$ weights plus a possible bias to be optimized, whereas using a filter of $3 \times 3$ size reduces the number of weights to be calculated to $3 \times 3 = 9$, plus a possible bias.