# Deep learning
http://neuralnetworksanddeeplearning.com/chap6.html

## Introducing convolutional networks
**Local receptive fields**: In the fully-connected layers shown earlier, the inputs were depicted as a vertical line of neurons.

![](http://neuralnetworksanddeeplearning.com/images/tikz42.png)

We only make connections in small, localized regions of the input image.

To be more precise, each neuron in the first hidden layer will be connected to a small region of the input neurons, say, for example, a 5×5 region, corresponding to 25 input pixels. So, for a particular hidden neuron, we might have connections that look like this:

![](http://neuralnetworksanddeeplearning.com/images/tikz43.png)

That region in the input image is called the **local receptive field** for the hidden neuron.

We then slide the local receptive field across the entire input image. 

![](http://neuralnetworksanddeeplearning.com/images/tikz44.png)

![](http://neuralnetworksanddeeplearning.com/images/tikz45.png)

In fact, sometimes a different **stride length** is used. For instance, we might move the local receptive field 22 pixels to the right (or down), in which case we'd say a stride length of 22 is used. 

**Shared weights and biases**: I've said that each hidden neuron has a bias and 5×5 weights connected to its local receptive field. What I did not yet mention is that we're going to use the same weights and bias for each of the 24×24 hidden neurons. In other words, for the j,kth hidden neuron, the output is:

$\sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m} a_{j+l, k+m} \right). \ (125)$

Here, σ is the neural activation function - perhaps the sigmoid function we used in earlier chapters. b is the shared value for the bias. $w_{l,m}$ is a 5×5 array of shared weights. And, finally, we use $a_{x,y}$ to denote the input activation at position x,y.

**Convolutional networks are well adapted to the translation invariance of images: move a picture of a cat (say) a little ways, and it's still an image of a cat.**

For this reason, we sometimes call the map from the input layer to the hidden layer a **feature map**. We call the weights defining the feature map the **shared weights**. And we call the bias defining the feature map in this way the **shared bias**. The shared weights and bias are often said to define a **kernel** or filter. 

To do image recognition we'll need more than one feature map. And so a complete convolutional layer consists of several different feature maps:

![](http://neuralnetworksanddeeplearning.com/images/tikz46.png)

In the example shown, there are 3 feature maps. Each feature map is defined by a set of 5×5 shared weights, and a single shared bias. The result is that the network can detect 3 different kinds of features, with each feature being detectable across the entire image.

Let's take a quick peek at some of the features which are learned:

![](http://neuralnetworksanddeeplearning.com/images/net_full_layer_0.png)

The 20 images correspond to 20 different feature maps (or filters, or kernels). Each map is represented as a 5×5 block image, corresponding to the 5×5 weights in the local receptive field. Whiter blocks mean a smaller (typically, more negative) weight, so the feature map responds less to corresponding input pixels. Darker blocks mean a larger weight, so the feature map responds more to the corresponding input pixels.

The operation in Equation (125) is sometimes known as a **convolution**. A little more precisely, people sometimes write that equation as $a^1 = \sigma(b + w * a^0)$, where $a^1$ denotes the set of output activations from one feature map, $a^0$ is the set of input activations, and ∗ is called a convolution operation.

**Pooling layers**: Pooling layers are usually used immediately after convolutional layers. What the pooling layers do is simplify the information in the output from the convolutional layer.

In detail, a pooling layer takes each feature map output from the convolutional layer and prepares a condensed feature map. For instance, each unit in the pooling layer may summarize a region of (say) $2 \times 2$ neurons in the previous layer. As a concrete example, one common procedure for pooling is known as **max-pooling**. In max-pooling, a pooling unit simply outputs the maximum activation in the $2 \times 2$ input region, as illustrated in the following diagram:

![](http://neuralnetworksanddeeplearning.com/images/tikz47.png)

Note that since we have $24 \times 24$ neurons output from the convolutional layer, after pooling we have $12 \times 12$ neurons.

We apply max-pooling to each feature map separately. So if there were three feature maps, the combined convolutional and max-pooling layers would look like:

![](http://neuralnetworksanddeeplearning.com/images/tikz48.png)

We can think of max-pooling as a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information. The intuition is that once a feature has been found, its exact location isn't as important as its rough location relative to other features. A big benefit is that there are many fewer pooled features, and so this helps reduce the number of parameters needed in later layers.

**L2 pooling**: Here, instead of taking the maximum activation of a 2×2 region of neurons, we take the square root of the sum of the squares of the activations in the 2×2 region. While the details are different, the intuition is similar to max-pooling: L2 pooling is a way of condensing information from the convolutional layer. 

**Putting it all together**:

![](http://neuralnetworksanddeeplearning.com/images/tikz49.png)

The final layer of connections in the network is a **fully-connected layer**. That is, this layer connects every neuron from the max-pooled layer to every one of the 1010 output neurons.

## Convolutional neural networks in practice
```py
import network3
from network3 import Network
from network3 import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer

training_data, validation_data, test_data = network3.load_data_shared()
mini_batch_size = 10
net = Network([
        FullyConnectedLayer(n_in=784, n_out=100),
        SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data, 60, mini_batch_size, 0.1,
        validation_data, test_data)
```

**Using rectified linear units**: That is, we'll use the activation function f(z)≡max(0,z). 

**Expanding the training data**: Another way we may hope to improve our results is by algorithmically expanding the training data.

**Inserting an extra fully-connected layer**.

**Using an ensemble of networks**.

**Why we only applied dropout to the fully-connected layers**: The convolutional layers have considerable inbuilt resistance to overfitting. The reason is that the shared weights mean that convolutional filters are forced to learn from across the entire image. This makes them less likely to pick up on local idiosyncracies in the training data.