# Overview: neural networks achieve pattern recognition

A *neural network* is a function that takes in input (such as a greyscale image) and produces output that indicates what pattern (for example, what numeric digit between 0 and 9) it percieves to exist in the input.

In the below **Structure of neural networks**, we discuss exactly how the components- *neurons*- of neural networks interact with each other. In **Training neural networks**, we describe how to *train* the network, and gradually improve the configuration of the connections between the neurons, to ensure the output actually achieves pattern recognition.

# Structure of neural networks

A *neural network* is a collection of layers of *neurons*. A neuron can be thought of as a configurable miniature machine that produces a number between 0 and 1, called the neuron's *activation value*. The activation values of first-layer neurons are used to store the inputs, and the activation values of last-layer neurons are used to represent the output. Specifically, each last-layer neuron is associated with a pattern that could be detected in the input; if a particular last-layer neuron has high activation while all other last-layer neurons have low activation, then the network "thinks" that the pattern associated with the highly activated neuron has been detected.

The layered structure is used to take advantage of the the fact that patterns are made up of subpatterns. Specifically, the last layer detects which pattern is present by relying on each neuron in the prior to last layer to have high activation only when its associated subpattern is present. Then, the activation value of a last-layer neuron with associated pattern $p$ is computed as a weighted sum¹ of prior to last-layer activation values. Let us now say "pattern neuron" to mean "last-layer neuron" and "subpattern neuron" to mean "prior-to-last-layer neuron". If the weights are configured appropriately (weights on subpattern neurons whose subpatterns appear in $p$ are close to $1$, and weights on subpattern neurons whose subpatterns don't appear in $p$ are close to $0$), then² this weighted sum is close to the sum of activations of the subpatterns that make up $p$. Now, the previous layer, which detects subpatterns, depends on *its* previous layer to detect subsubpatterns in the same way. And so on. So we see it's reasonable to have many layers in a neural network; that each neuron $n$ must have an associated vector (i.e. list) of weights, where each weight describes the strength of the connection from a previous-layer neuron to $n$; and that it makes sense for each neuron's activation value to be the weighted sum of activation values from the previous layer.

---
¹ *The weighted sum of numbers $a_1, ..., a_N$ by numeric weights $w_1, ..., w_N$ is defined to be $w_1 a_1 + ... + w_N a_N$. We "weight" each number by one of the weights, and compute the sum of all of the weighted numbers.*

² *To restate: given a pattern that's a "sum" of subpatterns, then, if weights are appropriately chosen, the activation value associated with the pattern is close to the sum of the corresponding subpattern activation values. One very pleasing mathematical way of saying this, though not quite technically true for a couple reasons, is: "There is approximately a linear isomorphism between patterns and pattern activation values."*

A small sidenote. Currently, we have concluded that the activation values of neurons should simply be weighted sums. Since weighted sums can output any sort of value between $-\infty$ and $\infty$, and as activation values must be between $0$ and $1$, then using weighted sums without any modifications will not work. To solve this, we give every layer a nonlinear function called an *activation function*. Perhaps activation functions should be called "normalization functions", since their purpose is to resize the results of weighted sums down into the range $[0, 1]$. A very common activation function is the *sigmoid function* $\sigma:(-\infty, \infty) \rightarrow (0, 1)$ defined by $\sigma(x) := 1/(1 + e^{-x})$. You may think this function looks complicated; the important things about it are that it maps $(-\infty, \infty)$ to $(0, 1)$, that it is increasing, and that it is smooth³. Lastly, we also give each layer a number called the *bias*. This is used to adjust the average activation values that the weighted sums of neurons are typical to have.

---
³ *By "smooth" I mean "differentiable".*

## Everything in formulas

Here is a quick summary of *all* of the above in a couple lines of more formal syntax for those who are familiar with such syntax. Feel free to skip this section.

Define $a^{(i)}_j$ to be the activation value of the $j$th neuron in the $i$th layer. Define $\mathbf{w}^{(i)}_j$ to be the vector (i.e. the list) of weights of the $j$th neuron in the $i$th layer. Notate the $k$th entry of $\mathbf{w}^{(i)}_j$, which is the weight from the $k$th neuron in layer $i - 1$ to the $j$th neuron in layer $i$, as $w^{(i)}_{kj}$. Define $n_i$ to be the number of neurons in the $i$th layer. Finally, define $\sigma:(-\infty, \infty) \rightarrow (0, 1)$ by $\sigma(x) = 1/(1 + e^{-x})$, and let $b_i$ be the bias of the $i$th layer.

Then each activation value is given by

$$a^{(i)}_j = \sigma \left(w^{(i)}_{1j} a^{(i-1)}_1 + w^{(i)}_{2j} a^{(i-1)}_2 + ... + w^{(i)}_{n_{i-1}j} a^{(i-1)}_{n_{i-1}} + b_i \right) = \sigma \left(\sum_{k = 1}^{n_{i - 1}} w^{(i)}_{kj} a^{(i - 1)}_k + b_i \right)$$

We could stop here. There are a couple of other ways to express the above, though, if you're interested. If we define $\mathbf{a}^{(i)}$ to be the vector of activation values in the $i$th layer, $\mathbf{a}^{(i)} := (a^{(i)}_1, ..., a^{(i)}_{n_i})$, then we can use the *dot product* to make things a lot more compact. The dot product between two vectors $\mathbf{u} := (u_1, ..., u_N)$ and $\mathbf{v} := (v_1, ..., v_N)$ is defined to be $\mathbf{u} \cdot \mathbf{v} := u_1 v_1 + ... + u_N v_N = \sum_{i = 1}^N u_i v_i$. With this definition, we have

$$a^{(i)}_j = \sigma \left( \mathbf{w}^{(i)}_j \cdot \mathbf{a}^{(i - 1)} + b_i \right)$$

Even this can be rewritten alternatively. If we define $\boldsymbol{\sigma}$ to be the function acting on vectors that sends $(v_1, ..., v_N)$ to $(\sigma(v_1), ..., \sigma(v_N))$, then from the above it follows that

$$\mathbf{a}^{(i)}
=
\begin{pmatrix}
  a^{(i)}_1 \\
  \vdots \\
  a^{(i)}_{n_i}
\end{pmatrix}
=
\boldsymbol{\sigma}
\left(
\begin{pmatrix}
    \mathbf{w}^{(i)}_1 \cdot \mathbf{a}^{(i - 1)} \\
    \vdots \\
    \mathbf{w}^{(i)}_{n_i} \cdot \mathbf{a}^{(i - 1)}
\end{pmatrix}
+
\underbrace{
  \begin{pmatrix}
    b_1 \\
    \vdots \\
    b_{n_i}
  \end{pmatrix}
}_{\mathbf{b}}
\right)$$

$$=
\boldsymbol{\sigma}
\left(
\underbrace{
  \begin{pmatrix}
    (\mathbf{w}^{(i)}_1)^\top \\
    \vdots \\
    (\mathbf{w}^{(i)}_{n_i})^\top
  \end{pmatrix}
}_{(\mathbf{W}^{(i)})^\top}
\mathbf{a}^{(i - 1)}
+
\mathbf{b}
\right)
=
\boldsymbol{\sigma}\left((\mathbf{W}^{(i)})^\top \mathbf{a}^{(i - 1)} + \mathbf{b}^{(i)}\right)$$

In the above we have used the fact that one way of expressing a matrix-vector product is to form the vector whose $i$th entry is the dot product of the $i$th row of the matrix with the $i$th entry of the vector.

# Training neural networks

## Cost functions

Different configurations of a neural network's weights will result in different behaviors of the network. We measure how much error there is in the network as a function of the network's parameters- the network's weights and biases- with a *cost function*, which is sometimes also called a *loss function*. In truth, cost functions are not only a function of the weights and biases, but also the *training data* (consisting of sample inputs and expected outputs) they use to compute error.

There are many ways to measure error and therefore many potential cost functions. One common cost function, called the *mean squared error*, or *MSE*, uses the standard Euclidean norm $||\cdot||$, so that the cost $c_k$ attributable to the $k$th training example is given by $c_k(\boldsymbol{\theta}) := ||\mathbf{y}_k - \mathbf{a}^{(L)}(\boldsymbol{\theta})||$, where $\boldsymbol{\theta}$ is a vector (i.e. a list) of weights and biases, $\mathbf{a}^{(L)}(\boldsymbol{\theta})$ is the vector (i.e. the list) of pattern neuron (i.e. layer $L$ neuron) activations as a function of $\boldsymbol{\theta}$, and $\mathbf{y}_k$ is the expected output corresponding to the input $\boldsymbol{\theta}$. The *average cost* that we seek to minimize is then $c(\boldsymbol{\theta}) := (1/n) \sum_{k = 1}^n c_k(\boldsymbol{\theta})$.

Understanding how to minimize the cost function requires some visualization. First, think of the network's weights and biases as all being stored in a many-dimensional vector $\boldsymbol{\theta}$, which lives in the "space" of all possible such vectors. If, for example, $\boldsymbol{\theta}$ only had one weight and one bias in it, and were thus two-dimensional, then the space of all possible $\boldsymbol{\theta}$ would be a plane. Second, imagine the value of the cost function $c$ as corresponding to height above the plane. This way, as we vary $\boldsymbol{\theta}$ around the plane, the cost $c(\boldsymbol{\theta})$ goes up and down in height, and we get a *cost surface*. In practical conditions, when we have many weights and biases and are thus in many more dimensions than three, analogies to this three-dimensional example can still be helpful.

## Gradient descent

So how do we actually minimize the cost $c$? We use the *gradient descent algorithm*, which relies on the mathematical fact that if $f$ is a function (like our cost function) that sends vectors to numbers, then the negative gradient $-\nabla f$ gives the direction of greatest decrease in $f$. The idea of the gradient descent algorithm is to start at the point on the cost surface corresponding to an initial configuration of the network, slightly change the weights and biases so as to take a small step of fixed size in the direction which decreases the cost the most (i.e. the direction of $-\nabla c$), and repeat. Stop when the fixed-size steps don't go very far from the point representing the current network configuration, as this indicates sufficient convergence to a local minimum⁴ has been achieved.

---
⁴ *Finding a global minimum of the cost function is much more difficult.*

## Computing the gradient with backpropagation

Of course, in order to perform gradient descent, we must be able to compute $(\nabla c)_{\boldsymbol{\theta}}$ for some configuration $\boldsymbol{\theta}$ of weights and biases. Since $(\nabla c)_{\boldsymbol{\theta}} = (1/n) \sum_{k = 1}^n (\nabla c_k)_{\boldsymbol{\theta}}$ the clear way to do this is to compute the gradient of each training example's cost, $(\nabla c_k)_{\boldsymbol{\theta}}$, for each $k$, and then evaluate the average $(\nabla c)_{\boldsymbol{\theta}} = (1/n) \sum_{k = 1}^n (\nabla c_k)_{\boldsymbol{\theta}}$.

The recursive *backpropagation algorithm* is what's used to compute the gradient $(\nabla c_k)_{\boldsymbol{\theta}}$, which we should remember is the gradient of a *particular* training example's cost function. Backpropagation is named what it is because it entails computing the components of the gradient involving the $i$th layer of the network by recursively already knowing the components of the gradient for the $(i + 1)$st layer of the network; the algorithm starts at the last layer and *propagates* the known components *back* until all are known.