# Multi-layer Perceptrons

* Basically, put together many neurons and you have a neural network (or a multi-layer perceptrons). 

* Suppose you had the following neural network:

<img src="ExampleNN.pdf" width="500">


with a hard-limit activation function: $\phi(v) = \left\{ \begin{array}{c c} -1 & v \le 0 \\ 1 & v > 0 \end{array}\right.$\\

* *What is the output with the following input values?*
    * $\left[0, 0\right]$
    * $\left[-2, -2.5\right]$
    * $\left[-5, 5\right]$
    * $\left[10, 3\right]$


* *What does the decision surface of this network look like graphically? Draw it out by hand.*
* *Suppose you had the XOR data shown in the figure below.  Design an MLP that can correctly solve this classification problem.*


<img src="XORdata.pdf" width="300">


## Universal Approximation Theorem: 

Let $\phi(\cdot)$ be a non-constant, bounded and monotone-increasing continuous function.  Let $I_{m_0}$ denote the $m_0$-dimensional unit hypercube $[0, 1]^{m_0}$.  The space of continuous functions on $I_{m_0}$ is denoted by $C(I_{m_0})$.  Then, given any function $f \ni C(I_{m_0})$ and $\epsilon > 0$, there exists an integer $m_1$ and sets of real constants $\alpha_i, \beta_i,$ and $w_{ij}$, where $i = 1, \ldots, m_1$ and $j = 1, \ldots, m_0$ such that we may define
\begin{equation}
F(x_1, \ldots, x_{m_0}) = \sum_{i=1}^{m_1} \alpha_i \phi\left( \sum_{j=1}^{m_0} w_{ij}x_j + b_i\right)
\end{equation}
as an approximation realization of the function $f(\cdot)$: that is, 
\begin{equation}
\left| F(x_1, \ldots, x_{m_0}) - f(x_1, \ldots, x_{m_0}) \right| < \epsilon
\end{equation}
for all $x_1, x_2, \ldots, x_{m_0}$ that like in the input space.}}

* Essentially, the Universal Approximation Theorem states that a single hidden layer is sufficient for a multilayer perceptron to compute a uniform $\epsilon$ approximation to a given training set - provided you have the *right* number of neurons and the *right* activation function.  (However, this does not say that a single hidden layer is optimal with regards to learning time, generalization, etc.)  


## Background for Error Back-Propagation}

* Error Back-Propagation is based on *gradient descent*.
* Let's review/learn gradient descent:

*Method of Gradient/Steepest Descent:*

*move in direction opposite to the gradient vector, $g = \bigtriangledown E(\mathbf{w})$
\begin{eqnarray}
w(n+1) &=& w(n) - \eta g(n)\\
\Delta w(n) &=& w(n+1) - w(n)\\
\Delta w(n) &=& - \eta g(n) \quad \text{ Error correction rule }
\end{eqnarray}
* Show that using steepest descent, $E(\mathbf{w}(n+1)) < E(\mathbf{w}(n)) $
*  Recall: Taylor Series Expansion: $f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + ....$
* Approximate $E(\mathbf{w}(n+1))$ with Taylor Series around $w(n)$
\begin{eqnarray}
E(w(n+1)) &\approx& E(w(n)) + \Delta E(w(n))(w(n+1) - w(n))\\
&\approx& E(w(n)) + g^T(n)(\Delta w(n))\\
&\approx& E(w(n)) - \eta g^T(n)g(n)\\
&\approx& E(w(n)) - \eta \left\| g(n) \right\|^2
\end{eqnarray}
* For positive, small $\eta$, the cost function is decreased


## Error Back-propagation

* There are many approaches to train a neural network.  
* One of the most commonly used is the *Error Back-Propagation Algorithm*.


* Two kinds of signals:
1. Function Signals: presumed to perform useful function at the output of the network, also called input signal
2.  Error Signals: propagates backwards, involves an error-dependent function 

* Each hidden or output neuron performs two computations:
1. Computation of function signal going out of this neuron
2.  Computation of an estimate of the gradient vector


* First let's consider the output layer...

* Given a training set, $\left\{ \mathbf{x}_n, d_n\right\}_{n = 1}^N$, we want to find the parameters of our network that minimizes the squared error: 
 \begin{equation}
 E(w) = \frac{1}{2} \sum_{n=1}^N (d_n - y_n)^2
 \end{equation}

* What is a common optimization approach to estimate the parameters that minimize an objective/error function? *gradient descent*
* To use gradient descent, what do we need?  The analytic form of the gradient. 

 \begin{eqnarray}
 \frac{ \partial E}{\partial w_i} &=& \frac{\partial}{\partial w_i} \left[ \frac{1}{2} \sum_{n=1}^N (d_n - y_n)^2 \right]\\
 &=&  \frac{1}{2} \sum_{n=1}^N  \frac{\partial}{\partial w_i}  (d_n - y_n)^2 \\
 &=&  \frac{1}{2} \sum_{n=1}^N   2(d_n - y_n) \frac{\partial}{\partial w_i} (d_n - y_n) \\
  &=&  \sum_{n=1}^N   (d_n - y_n) \left( \frac{\partial}{\partial w_i} d_n -  \frac{\partial}{\partial w_i} y_n \right) \\
    &=&  \sum_{n=1}^N   (d_n - y_n) \left(  -  \frac{\partial }{\partial w_i} y_n \right) 
 \end{eqnarray}

* What is $y_n$ in terms of $w_i$?  (At first let's assume we have no hidden layers, only the output layer to deal with)

\begin{equation}
y_n = \phi(v_n) = \phi(\mathbf{w}^T \mathbf{x}_n) 
\end{equation} 

* Going back to computing our gradient... 
 \begin{eqnarray}
    &=&  \sum_{n=1}^N   (d_n - y_n) \left(  -  \frac{\partial }{\partial w_i} y_n \right) \\
    &=&  \sum_{n=1}^N  - (d_n - y_n)   \frac{\partial y_n}{\partial v_n} \frac{\partial v_n}{\partial w_i} 
 \end{eqnarray}

* So, $\frac{\partial y_n}{\partial v_n}$ will depend what form of an activation function we use.  If we use the sigmoid: $y_n = \frac{1}{1 + \exp(-\alpha v_n)}$, then \emph{what is } $\frac{\partial y_n}{\partial v_n}$ ? 

 \begin{eqnarray}
 \frac{\partial y_n}{\partial v_n} &=& \frac{\partial \phi(v_n)}{\partial v_n}\\
  &=& \frac{ \partial }{\partial v_n}  \frac{1}{1 + \exp(-\alpha v_n)} \\
  &=& \frac{  \left(1 + \exp(- \alpha v_n)\right)\left(\frac{ \partial }{\partial v_n} 1\right) - \left(1\right)\left( \frac{ \partial }{\partial v_n}  1 + \exp(- \alpha v_n) \right)}{(1 + \exp(-\alpha v_n))^2}\\
    &=& \frac{  - \frac{ \partial }{\partial v_n}  (1 + \exp(- \alpha v_n) ) }{(1 + \exp(-\alpha v_n))^2}\\
    &=& \frac{  -1  }{(1 + \exp(-\alpha v_n))^2} \exp(-\alpha v_n)(-\alpha)\\
    &=&\frac{  1  }{1 + \exp(-\alpha v_n)} \frac{  1  }{1 + \exp(- \alpha v_n)} \exp(-\alpha v_n)\\
        &=&\frac{  1  }{1 + \exp(-\alpha v_n)} \frac{  \exp(-\alpha v_n)  }{1 + \exp(-\alpha v_n)}\\
        &=& y_n (1-y_n)
 \end{eqnarray}

* Going back to computing our gradient... 
 \begin{eqnarray}
    &=&  \sum_{n=1}^N  - (d_n - y_n)   \frac{\partial y_n}{\partial v_n} \frac{\partial v_n}{\partial w_i} \\
     &=&  \sum_{n=1}^N  - (d_n - y_n)   y_n (1-y_n) \frac{\partial v_n}{\partial w_i}\\
     &=&  \sum_{n=1}^N  - (d_n - y_n)   y_n (1-y_n) \frac{\partial }{\partial w_i} \mathbf{w}^T \mathbf{x}_n\\
     &=&  \sum_{n=1}^N  - (d_n - y_n)   y_n (1-y_n) x_{ni}
 \end{eqnarray}

* *Now that we have the gradient, how do we use this to update the output layer weights in our MLP?*
* *How will this update equation  (for the output layer) change if the network is a multilayer perceptron with hidden units?*
 \item \emph{Can you write this in vector form to update all weights simultaneously? }

