&copy;Copyright 2017 Shuang Wu<br>
cite from the Neural Networks and Deep Learning book by Michael Nielsen http://neuralnetworksanddeeplearning.com <br>
Learning notes

## CH6 
## Deep Learning

### Intro convolutional networks
Fully connectted to one another, every neuron in the network is connected to every neuron in adjacent layers:<br>
![cnn1](/notebooks/imgs/cnn1.jpg)<br>
<strong>The fully-connected layers networks do not take into account the spatial structure of the images.</strong> Because each pixel is one of the neuron and they all fully connected, so the far apart and close together pixeles on excatly the same situation/footing. Thus, we have CNN.<br>
<strong>Convolutional neural networks</strong><br>
This networks use a special architecture which is particularly well-adapted to classift images. There are 3 basic ideas: <i>local receptive fields, shared weights, pooling</i>.<br>

<strong>Local receptive fields:</strong><br>
Instead of thinking as vertical input neuron line, we think $28*28$ square of neurons.<br>
![cnn2](/notebooks/imgs/cnn2.jpg)<br>
As usual, we want to connect this to hidden neuron, but won't connect every input pixel to every hidden neuron. Only make connections in small, localized regions of the input image. E.g, $5*5$ region, have total 25 input pixels connect to one hidden neuron, show as below:<br>
![cnn3](/notebooks/imgs/cnn3.jpg)<br>
That region called, <i> local receptive field</i> for the hidden neuron. Each connection learns a weight. And the hidden neuron learns an overall bias. Can think the particular hidden neuron as learning to analyze its particular local receptive field. We than slide the local receptive field across the entire input image. Show as below:<br>
![cnn4](/notebooks/imgs/cnn4.jpg)<br>
When we have $28*28$ image, and $5*5$ local receptive fields, the hidden layer will be $24*24$. The fields can go to the right and bottom by one pixel or by different pixel, named <i> stride length</i>. 1 is the most used.<br>

<strong>Shared weights and biases:</strong><br>
Each hidden neuron has a bias and $5*5$ weights connected to local receptive field. And we're going to use the same weights and bias for each of the $24*24$ hidden neurons. For the $j,k$th hidden neuron, the output is:<br>
$$\sigma(b+\sum^4_{l=0}\sum^4_{m=0}w_{l,m}a_{j+l,k+m})$$<br>
$\sigma$ is the neural activation function. b is the shared valur for the bias. $w_{l,m}$ is a $5*5$ array of shared weights. $a_{x,y}$ is the input activation at postition $x,y$. This means that all the neurons in the 1st hidden layer detect exactly the same feature, just at different locations in the input image. Convolutional networks are well adapted to the translation invariance of images: move a picture of a cat a little ways, and it's still an image of a cat. Because this, we sometime call the map from the input layer to the hidden layer a <i>feature map</i>. Call the weights defining the feature map the <i>shared weights</i>. Call the bias deining the feature map the <i>shared bias</i>. The shared weights and bias are often said to define a <i>kernal</i> or <i> filter</i>. This is just a single localized feature. A complete convolutional layer consists of several different feature maps:<br>
![cnn5](/notebooks/imgs/cnn5.jpg)<br>
In the above image, there are 3 feature maps, each defined by a set of $5*5$ filter, and a single shared bias. So the network can detect 3 different kinds of features. each will be detact from the entire image.<br>
In practice, we will use much more feature maps instead of 3. One of the early convolutional networks, LeNet-5, used 6 feature maps. Later we'll use 20 and 40 features maps. The example feature learned:<br>
![cnn6](/notebooks/imgs/cnn6.jpg)<br>
The above 20 images correpond to 20 different filters. The images above show the type of features the convolutional layer responds to.<br>
The fully-connected layer would have more than 40 times as many parameters as the convolutional layer.<br>
The name convolutional comes from the fact that the operation in euqation above is sometimes know as a <i>convolution</i>. The equation can be write as:<br>
$$a^1=\sigma(b+w*a^0)$$<br>
$a^1$ is the output activations from one feature map, $a^0$ is the set of input activations, $*$ named as convolution operation.

<strong>Pooling layers:</strong><br>
Pooling layers are usually used immediately after convolutional layers. It's simplify the information in the output from the convolutional layer. It takes each feature map output from the convolutional layer and prepares a condensed feature map. Each unit in the pooling layer may summarize a region of $2*2$ neurons in the previous layer.<br>
One common procedire known as <i>max-pooling</i>. a pooling unit simply outputs the maximum activation in the $2*2$ input region, like below:
![cnn7](/notebooks/imgs/cnn7.jpg)<br>
$24*24$ neurons after $2*2$ pooling will be $12*12$. Apply the max-pooling to each feature map separately. Show as follow:<br>
![cnn8](/notebooks/imgs/cnn8.jpg)<br>
Can see the max-pooling as a way for the network to ask whether a given feature is found anywhere in a region and throws away the exact position information. This can help reduce the number of parameters needed in later layers.<br>

Another common approach named <strong>L2 pooling</strong>. Taking the square root of the sum of the squares of the activations in the $2*2$ region. This is a way of condensing information from the convolutional layer.<br>
If really trying to optimize performance, use <strong>validation data</strong> to compare several different approaches to pooling.<br>

<strong>Putting it all together:</strong><br>
Now put all toghet to form a complete convolutional neural network. Similar to the network like before, but add a layer of 10 output neurons, corresponding to the 10 possible values categoral, (0, 1, etc.)<br>
![cnn9](/notebooks/imgs/cnn9.jpg)<br>
First, the input layer w/ the full pixel, then convolutional layer w/ $5*5$ local receptive field and 3 feature maps. Then a max-pooling layer w/ $2*2$ region across the 3 feature map. Final is a fully-connected layer. This layer connects every neuron from the max-pooled layer to every one of the 10 output neurons. This one is the same as before in the previous chapter.<br>

The network structure is different as before but have the same picture: a network made of many simple units, whose behavors are determined by their weights and biases. The overall goal is still to use training data to train the network's weghts and biases so the network does a good job classifying input digits.<br>
Like before, we use SGD and backpropagation with some modifications to the backpropagation procedure. Because we need it use for the convolutional and max-pooling layers.<br>

### Convolutional neural networks in practice
We use the machine learning library know as Theano in python. This makes easy to implement backpropagation for convolutional neural networks, since it automatically computes all the mappings involved. And theano can run on either CPU or GPU.<br>
Start the example with shallow architecture using just a single hidden layer, w/ 100 hidden neurons. Train for 60 epochs, using a learning rate of $\eta=0.1$ and mini-batch size of 10 w/o regularization.

In [2]:
# add the folder which include the code
import sys
sys.path.insert(0, 'code')

In [10]:
import network3
from network3 import Network
from network3 import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer
training_data, validation_data, test_data = network3.load_data_shared()
mini_batch_size = 10
net = Network([
        FullyConnectedLayer(n_in=784, n_out=100),
        SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data, 60, mini_batch_size, 0.1, 
            validation_data, test_data)

Training mini-batch number 0
Training mini-batch number 1000
Training mini-batch number 2000
Training mini-batch number 3000
Training mini-batch number 4000
Epoch 0: validation accuracy 92.84%
This is the best validation accuracy to date.
The corresponding test accuracy is 92.16%
Training mini-batch number 5000


KeyboardInterrupt: 

Now inserting a convolutional layer, at the beginning of the network w/ $5*5$ filter and stride length of 1 and 20 feature maps. Also a max-pooling layer, $2*2$ pooling windows. The arch. look like below:<br>
![cnn10](/notebooks/imgs/cnn10.jpg)<br>
In this arch. the convolutional and pooling layers can be seen as learning about local spatial structure in the input training image and the fully-connected layer learns at a more abstract level, integrating global information from across the entire image.

In [11]:
net = Network([
        ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), 
                      filter_shape=(20, 1, 5, 5), 
                      poolsize=(2, 2)),
        FullyConnectedLayer(n_in=20*12*12, n_out=100),
        SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data, 60, mini_batch_size, 0.1, 
            validation_data, test_data)

  input=conv_out, ds=self.poolsize, ignore_border=True)


Training mini-batch number 0
Training mini-batch number 1000


KeyboardInterrupt: 

Then can try insert second convolutiona-pooling layer. W/ $5*5$ local receptive field, and pool over $2*2$ regions.

In [None]:
net = Network([
        ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), 
                      filter_shape=(20, 1, 5, 5), 
                      poolsize=(2, 2)),
        ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), 
                      filter_shape=(40, 20, 5, 5), 
                      poolsize=(2, 2)),
        FullyConnectedLayer(n_in=40*4*4, n_out=100),
        SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data, 60, mini_batch_size, 0.1, 
            validation_data, test_data)  

<strong>Using rectified linear units:</strong><br>
First try to change the neurons so instead of using a sigmoid activation function, we use rectified linear units. Use the activation function $f(z)\equiv \max(0, z)$. And add the L2 regularization parameter.

In [None]:
from network3 import ReLU
net = Network([
        ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), 
                      filter_shape=(20, 1, 5, 5), 
                      poolsize=(2, 2), 
                      activation_fn=ReLU),
        ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), 
                      filter_shape=(40, 20, 5, 5), 
                      poolsize=(2, 2), 
                      activation_fn=ReLU),
        FullyConnectedLayer(n_in=40*4*4, n_out=100, activation_fn=ReLU),
        SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data, 60, mini_batch_size, 0.03, 
            validation_data, test_data, lmbda=0.1)

This gives the accuracy of 99.23%. Networks based on rectified linear units,ReLU, consitently outperformed networks based on sigmoid activation function.

<strong>Expanding the training data:</strong><br>
Displace, rotating, translating, skewing.<br>

<strong>Inserting an extra fully-connected layer:</strong><br>
Expand the size of the fully-connected layer.<br>
Add extra fully-connected layer. But doesn't help much in this example. And also can add the dropout.<br>
Reduced the # of training epochs to 40: dropout reduced overfitting, and so we learned faster.<br>
Fully-hidden layers have 1,000 neurons, not the 100 used earlier.<br>

<strong>Using an ensemble of networks:</strong><br>
Create several neural networks, and then get them to vore to determine the best classification. This will help but not that much.

<strong>Why we only applied dropout to the fully-connected layers:</strong><br>
No need for the convolutional layers. The convolutional layers have considerable inbuit resistance to overfitting. The shared weights mean that convolutional filters are forced to learn from across the entire image. This makes them less likely to pick up on local idiosyncracies in the training data. So no regularization need.

<strong>Going further:</strong><br>
