# Using neural nets to recognize handwritten digits
http://neuralnetworksanddeeplearning.com/chap1.html

## Perceptrons
![](http://neuralnetworksanddeeplearning.com/images/tikz0.png)

$
\begin{eqnarray}
  \mbox{output} & = & \left\{ \begin{array}{ll}
      0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\
      1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold}
      \end{array} \right.
\end{eqnarray} \ (1)
$

Using the bias instead of the threshold, the perceptron rule can be rewritten:

$
\begin{eqnarray}
  \mbox{output} = \left\{ 
    \begin{array}{ll} 
      0 & \mbox{if } w\cdot x + b \leq 0 \\
      1 & \mbox{if } w\cdot x + b > 0
    \end{array}
  \right.
\end{eqnarray} \ (2)
$

## Sigmoid neurons
![](http://neuralnetworksanddeeplearning.com/images/tikz8.png)

Also just like a perceptron, the sigmoid neuron has weights for each input, $w_1, w_2,\ldots$,and an overall bias, b. But the output is not 0 or 1. Instead, it's $\sigma(w \cdot x+b)$, where σ is called the sigmoid function, and is defined by:

$
\begin{eqnarray} 
  \sigma(z) \equiv \frac{1}{1+e^{-z}}.
\end{eqnarray} \ (3)
$

To put it all a little more explicitly, the output of a sigmoid neuron with inputs $x_1,x_2,\ldots$, weights $w_1,w_2,\ldots$, and bias b is

$
\begin{eqnarray} 
  \frac{1}{1+\exp(-\sum_j w_j x_j-b)}.
\end{eqnarray} \ (4)
$

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-5/78296545.jpg)

This shape is a smoothed out version of a step function:

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-5/64005260.jpg)

In fact, calculus tells us that Δoutput is well approximated by

$
\begin{eqnarray} 
  \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}
  \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b,
\end{eqnarray} \ (5)
$

## The architecture of neural networks
![](http://neuralnetworksanddeeplearning.com/images/tikz11.png)

## A simple network to classify handwritten digits
To recognize individual digits we will use a three-layer neural network:

![](http://neuralnetworksanddeeplearning.com/images/tikz12.png)

Our training data for the network will consist of many 28 by 28 pixel images of scanned handwritten digits, and so the input layer contains 784=28×28 neurons.

We'll experiment with different values for n. The example shown illustrates a small hidden layer, containing just n=15 neurons.

## Learning with gradient descent
The [MNIST][1] data comes in two parts. The first part contains 60,000 images to be used as training data. The second part of the MNIST data set is 10,000 images to be used as test data.

cost function：

$
\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2.
\end{eqnarray} \ (6)
$

[1]: http://yann.lecun.com/exdb/mnist/

Calculus tells us that C changes as follows:

$
\begin{eqnarray} 
  \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +
  \frac{\partial C}{\partial v_2} \Delta v_2.
\end{eqnarray} \ (7)
$

We denote the gradient vector by ∇C, i.e.:

$
\begin{eqnarray} 
  \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, 
  \frac{\partial C}{\partial v_2} \right)^T.
\end{eqnarray} \ (8)
$

With these definitions, the expression (7) for ΔC can be rewritten as

$
\begin{eqnarray} 
  \Delta C \approx \nabla C \cdot \Delta v.
\end{eqnarray} \ (9)
$

In particular, suppose we choose

$
\begin{eqnarray} 
  \Delta v = -\eta \nabla C,
\end{eqnarray} \ (10)
$

where η is a small, positive parameter (known as the `learning rate`). Then Equation (9) tells us that $\Delta C \approx -\eta
\nabla C \cdot \nabla C = -\eta \|\nabla C\|^2$.

We'll use Equation (10) to compute a value for Δv, then move the ball's position v by that amount:

$
\begin{eqnarray}
  v \rightarrow v' = v -\eta \nabla C.
\end{eqnarray} \ (11)
$

Then we'll use this update rule again, to make another move. If we keep doing this, over and over, we'll keep decreasing C until - we hope - we reach a global minimum.

Suppose in particular that C is a function of m variables, $v_1,\ldots,v_m$. Then the change ΔC in C produced by a small change $\Delta v = (\Delta v_1,\ldots, \Delta v_m)^T$ is

$
\begin{eqnarray} 
  \Delta C \approx \nabla C \cdot \Delta v,
\end{eqnarray} \ (12)
$

where the gradient ∇C is the vector

$
\begin{eqnarray}
  \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots, 
  \frac{\partial C}{\partial v_m}\right)^T.
\end{eqnarray} \ (13)
$

Just as for the two variable case, we can choose

$
\begin{eqnarray}
  \Delta v = -\eta \nabla C,
\end{eqnarray} \ (14)
$

and we're guaranteed that our (approximate) expression (12) for ΔC will be negative. This gives us a way of following the gradient to a minimum, even when C is a function of many variables, by repeatedly applying the update rule

$
\begin{eqnarray}
  v \rightarrow v' = v-\eta \nabla C.
\end{eqnarray} \ (15)
$

Writing out the gradient descent update rule in terms of components, we have

$
\begin{eqnarray}
  w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \ (16)\\
  b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}. \ (17)
\end{eqnarray}
$

An idea called `stochastic gradient descent` can be used to speed up learning. The idea is to estimate the gradient ∇C by computing ∇Cx for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient ∇C, and this helps speed up gradient descent, and thus learning.

To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number m of randomly chosen training inputs. We'll label those random training inputs $X_1, X_2, \ldots, X_m$, and refer to them as a mini-batch. Provided the sample size m is large enough we expect that the average value of the $\nabla C_{X_j}$  will be roughly equal to the average over all $\nabla C_x$, that is,

$
\begin{eqnarray}
  \frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_x \nabla C_x}{n} = \nabla C,
\end{eqnarray} \ (18)
$

where the second sum is over the entire set of training data. Swapping sides we get

$
\begin{eqnarray}
  \nabla C \approx \frac{1}{m} \sum_{j=1}^m \nabla C_{X_{j}},
\end{eqnarray} \ (19)
$

Suppose $w_k$ and $b_l$ denote the weights and biases in our neural network. Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those,

$
\begin{eqnarray} 
  w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m}
  \sum_j \frac{\partial C_{X_j}}{\partial w_k} \ (20)\\ 
  b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m}
  \sum_j \frac{\partial C_{X_j}}{\partial b_l},\ (21)
\end{eqnarray} 
$

where the sums are over all the training examples $X_j$ in the current mini-batch. Then we pick out another randomly chosen mini-batch and train with those. And so on, until we've exhausted the training inputs, which is said to complete an `epoch` of training. At that point we start over with a new training epoch.

## Implementing our network to classify digits

In [1]:
import mnist_loader
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()

In [3]:
len(training_data), len(validation_data), len(test_data)

(50000, 10000, 10000)

In [7]:
import network
net = network.Network([784, 30, 10])

Finally, we'll use stochastic gradient descent to learn from the MNIST training_data over 30 epochs, with a mini-batch size of 10, and a learning rate of η=3.0

In [8]:
net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

Epoch 0: 9137 / 10000
Epoch 1: 9259 / 10000
Epoch 2: 9310 / 10000
Epoch 3: 9317 / 10000
Epoch 4: 9398 / 10000
Epoch 5: 9429 / 10000
Epoch 6: 9414 / 10000
Epoch 7: 9395 / 10000
Epoch 8: 9440 / 10000
Epoch 9: 9450 / 10000
Epoch 10: 9452 / 10000
Epoch 11: 9473 / 10000
Epoch 12: 9497 / 10000
Epoch 13: 9508 / 10000
Epoch 14: 9466 / 10000
Epoch 15: 9492 / 10000
Epoch 16: 9504 / 10000
Epoch 17: 9496 / 10000
Epoch 18: 9509 / 10000
Epoch 19: 9470 / 10000
Epoch 20: 9508 / 10000
Epoch 21: 9518 / 10000
Epoch 22: 9500 / 10000
Epoch 23: 9508 / 10000
Epoch 24: 9522 / 10000
Epoch 25: 9506 / 10000
Epoch 26: 9503 / 10000
Epoch 27: 9508 / 10000
Epoch 28: 9521 / 10000
Epoch 29: 9505 / 10000


Let's rerun the above experiment, changing the number of hidden neurons to 100. 

In [9]:
net = network.Network([784, 100, 10])
net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

Epoch 0: 8408 / 10000
Epoch 1: 8530 / 10000
Epoch 2: 8592 / 10000
Epoch 3: 8635 / 10000
Epoch 4: 8647 / 10000
Epoch 5: 8672 / 10000
Epoch 6: 8684 / 10000
Epoch 7: 8697 / 10000
Epoch 8: 8710 / 10000
Epoch 9: 8714 / 10000
Epoch 10: 8727 / 10000
Epoch 11: 8722 / 10000
Epoch 12: 8699 / 10000
Epoch 13: 8737 / 10000
Epoch 14: 8736 / 10000
Epoch 15: 8744 / 10000
Epoch 16: 8743 / 10000
Epoch 17: 8733 / 10000
Epoch 18: 8749 / 10000
Epoch 19: 8752 / 10000
Epoch 20: 8752 / 10000
Epoch 21: 8753 / 10000
Epoch 22: 8754 / 10000
Epoch 23: 8755 / 10000
Epoch 24: 8752 / 10000
Epoch 25: 8758 / 10000
Epoch 26: 8762 / 10000
Epoch 27: 8758 / 10000
Epoch 28: 8767 / 10000
Epoch 29: 8759 / 10000
