# Using neural nets to recognize handwritten digits
http://neuralnetworksanddeeplearning.com/chap1.html

## Perceptrons
![](http://neuralnetworksanddeeplearning.com/images/tikz0.png)

$
\begin{eqnarray}
  \mbox{output} & = & \left\{ \begin{array}{ll}
      0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\
      1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold}
      \end{array} \right.
\end{eqnarray} \ (1)
$

Using the bias instead of the threshold, the perceptron rule can be rewritten:

$
\begin{eqnarray}
  \mbox{output} = \left\{ 
    \begin{array}{ll} 
      0 & \mbox{if } w\cdot x + b \leq 0 \\
      1 & \mbox{if } w\cdot x + b > 0
    \end{array}
  \right.
\end{eqnarray} \ (2)
$

## Sigmoid neurons
![](http://neuralnetworksanddeeplearning.com/images/tikz8.png)

Also just like a perceptron, the sigmoid neuron has weights for each input, $w_1, w_2,\ldots$,and an overall bias, b. But the output is not 0 or 1. Instead, it's $\sigma(w \cdot x+b)$, where σ is called the sigmoid function, and is defined by:

$
\begin{eqnarray} 
  \sigma(z) \equiv \frac{1}{1+e^{-z}}.
\end{eqnarray} \ (3)
$

To put it all a little more explicitly, the output of a sigmoid neuron with inputs $x_1,x_2,\ldots$, weights $w_1,w_2,\ldots$, and bias b is

$
\begin{eqnarray} 
  \frac{1}{1+\exp(-\sum_j w_j x_j-b)}.
\end{eqnarray} \ (4)
$

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-5/78296545.jpg)

This shape is a smoothed out version of a step function:

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-5/64005260.jpg)

In fact, calculus tells us that Δoutput is well approximated by

$
\begin{eqnarray} 
  \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}
  \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b,
\end{eqnarray} \ (5)
$

## The architecture of neural networks
![](http://neuralnetworksanddeeplearning.com/images/tikz11.png)

## A simple network to classify handwritten digits
To recognize individual digits we will use a three-layer neural network:

![](http://neuralnetworksanddeeplearning.com/images/tikz12.png)

Our training data for the network will consist of many 2828 by 2828 pixel images of scanned handwritten digits, and so the input layer contains 784=28×28784=28×28 neurons.

We'll experiment with different values for n. The example shown illustrates a small hidden layer, containing just n=15 neurons.

## Learning with gradient descent
The [MNIST][1] data comes in two parts. The first part contains 60,000 images to be used as training data. The second part of the MNIST data set is 10,000 images to be used as test data.

cost function：

$
\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2.
\end{eqnarray} \ (6)
$

[1]: http://yann.lecun.com/exdb/mnist/

Calculus tells us that CC changes as follows:

$
\begin{eqnarray} 
  \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +
  \frac{\partial C}{\partial v_2} \Delta v_2.
\end{eqnarray} \ (7)
$

We denote the gradient vector by ∇C, i.e.:

$
\begin{eqnarray} 
  \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, 
  \frac{\partial C}{\partial v_2} \right)^T.
\end{eqnarray} \ (8)
$

With these definitions, the expression (7) for ΔC can be rewritten as

$
\begin{eqnarray} 
  \Delta C \approx \nabla C \cdot \Delta v.
\end{eqnarray} \ (9)
$

In particular, suppose we choose

$
\begin{eqnarray} 
  \Delta v = -\eta \nabla C,
\end{eqnarray} \ (10)
$

where η is a small, positive parameter (known as the `learning rate`). Then Equation (9) tells us that $\Delta C \approx -\eta
\nabla C \cdot \nabla C = -\eta \|\nabla C\|^2$.

We'll use Equation (10) to compute a value for Δv, then move the ball's position vv by that amount:

$
\begin{eqnarray}
  v \rightarrow v' = v -\eta \nabla C.
\end{eqnarray} \ (11)
$

Then we'll use this update rule again, to make another move. If we keep doing this, over and over, we'll keep decreasing C until - we hope - we reach a global minimum.

Suppose in particular that C is a function of m variables, $v_1,\ldots,v_m$. Then the change ΔC in C produced by a small change $\Delta v = (\Delta v_1,\ldots, \Delta v_m)^T$ is

$
\begin{eqnarray} 
  \Delta C \approx \nabla C \cdot \Delta v,
\end{eqnarray} \ (12)
$

where the gradient ∇C is the vector

$
\begin{eqnarray}
  \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots, 
  \frac{\partial C}{\partial v_m}\right)^T.
\end{eqnarray} \ (13)
$

Just as for the two variable case, we can choose

$
\begin{eqnarray}
  \Delta v = -\eta \nabla C,
\end{eqnarray} \ (14)
$

and we're guaranteed that our (approximate) expression (12) for ΔC will be negative. This gives us a way of following the gradient to a minimum, even when C is a function of many variables, by repeatedly applying the update rule

$
\begin{eqnarray}
  v \rightarrow v' = v-\eta \nabla C.
\end{eqnarray} \ (15)
$