# Neural Networks: Non-Linear Hypothesis

## Motivations

### Non-Linear Hypothesis
* Adding polynomials to fit complex shapes works well when you have 2 features.  It may not work well with multiple features. For example n = 100 -> 5000 terms with just second order considered.
* Image analysis needs non linear hypothesis.  
* Large n's lead to to many features



### Neurons and the Brain
* "one learning algorithm" hypothesis, validated by neuro-rewiring experiments.


## Neural Networks

### Model Representation
* Modeled off networked neurons in the brain.
* Neuron modeled as a logistic unit.
* output computed as 
$h_\theta (x) = \frac{1}{1 + \exp^{\theta^T X}}$
* may be influenced by a biased unit, $x_0 = 1$
* Neural network example: Layer 1 ~ Input layer, Layer 2 ~ Hidden layer, Layer 3 ~ output layer.

Notation
* $i$ ~ index, unit in each layer
* $j$ ~ index, layers
* $s_j$ ~ number of activation nodes
* $a_i^j$ ~ "activation" of unit i in layer j
* $\Theta^j$ ~ matrix of weights controling function mapping from layer $j$ to layer $j+1$.
* $g$ ~ sigmoid $g(z) = \frac{1}{1+\exp^{-z}}$

Example walk through

\begin{align*} 
&a_1^{(2)} = g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3) \newline 
&a_2^{(2)} = g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3) \newline 
&a_3^{(2)} = g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3) \newline 
h_\Theta(x) = &a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)}) \newline 
\end{align*}

If the network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, then $\Theta^j$ will be of dimension $s_{j+1} \times (s_j + 1)$.  IE, with this simple example at j = 2, s_j = 3, s_j+1 = 1 -> dimension = 1 x 4

rewrite the $a_1^{(2)}$ as $a_1^{(2)} = g(z_1^{(2)})$, allows us to use vectorized implementation.

$z^{(2)} = \Theta^{(1)} a^{(1)}$ and $a^{(2)} = g(z^{(2)})$

To add the bias unit $a_0^{(2)} = 1$, now $a^{(2)}$ is 4D.

$z^{(3)} = \Theta^{(2)} a^{(2)}$

$h_\Theta = a^{(3)} = g(z^{(3)})$

Process of computing $h(x)$ is called forward propagation.

\begin{align*}a_1^{(2)} = g(z_1^{(2)}) \newline a_2^{(2)} = g(z_2^{(2)}) \newline a_3^{(2)} = g(z_3^{(2)}) \newline \end{align*}

In other words, for layer j=2 and node k, the variable z will be:

z_k^{(2)} = \Theta_{k,0}^{(1)}x_0 + \Theta_{k,1}^{(1)}x_1 + \cdots + \Theta_{k,n}^{(1)}x_n

The vector representation of $x$ and $z^j$ is:

\begin{align*}x = \begin{bmatrix}x_0 \newline x_1 \newline\cdots \newline x_n\end{bmatrix} &z^{(j)} = \begin{bmatrix}z_1^{(j)} \newline z_2^{(j)} \newline\cdots \newline z_n^{(j)}\end{bmatrix}\end{align*}

Setting $x = a^{(1)}$, rewrite:

$z^{(j)} = \Theta^{(j-1)}a^{(j-1)}$

now:

$a^{(j)} = g(z^{(j)})$, where function g is applied element wide to vector $z^{(j)}$.

Next, add bias unit (equal to 1) to layer $j$ after computing $a^{(j)}$.  This is element $a_0^{(j)} = 1$.

Finally:

$z^{(j+1)} = \Theta^{(j)} a^{(j)}$

The last $\Theta$ matrix will have only **one row** which is multiplied by a **single column** $a^{(j)}$ that yields a single number.

Final result calculated with:

$h_\Theta(x) = a^{(j+1)} = g(z^{(j+1)})$

This final step is the same performed in logistic regression.

The neural network is similar to logistic regression except it is using the inner layers which are determined by the features in layers above it.  This allows the network to find better estimates for useful features and is not limited to the input features.

## Applications


### Examples and Intuitions

#### Simple AND example

Can we get a one unit example to illustrate an AND example

$\Theta^{(1)} = [-30, 20, 20]$, $x_0 = 1$

-> 

$h_\Theta(x) = g(-30 + 20 x_1 + 20 x_2)$

Sigmoid landmarks: $g(4.6) = 0.99$, $g(-4.6) = 0.01$

For $x_1 = [0, 0, 1, 1]$ and $x_2 = [0, 1, 0, 1]$

$h_\Theta(x) = [g(-30), g(-10), g(-10), g(1)] ~= [0, 0, 0, 1]$

$h_\Theta^{(x)} ~= x_1$ AND $x_2$

An example of $x_1$ OR $x_2$ uses $\Theta^{(1)} = [-10, 20, 20]$

An example of NOT $x_1$ uses $\Theta = [10, -20]$

An example of NOT $x_1$ and NOT $x_2$ uses $\Theta = [10, -20, -20]$

#### Classification : XNOR

* XNOR ~ Not X1 or X2

Putting the above together with an input layer, one hidden layer, and an output layer: $a_1^{(2)} =$ $x_1$ AND $x_2$, $a_2^{(2)} =$ NOT $x_1$ AND NOT $x_2$, and $a_1^{(3)} =$ $x_1$ OR $x_2$.  This yields a $h_\Theta^{(x)} ~= x_1$ XNOR $x_2$

This illustrates that relatively simple functions can be layered to generate complex results.

### Multiclass Classification

Output determines which classification the image belongs to.  This requires multiple training sets for each classification.  Previously $y=[1, 2, 3, 4]$ now we want to associate an image with a classification.  $(x^{(i)},y^{(i)})$ is now $[1;0;0;0] , [0;1;0;0], [0;0;1;0], [0;0;0;1]$ and the image is $x^{(i)}$.

exp:
1. Any logical fucntion can be represented by neural, XOR is 3 layers, hidden activations defined by sigmoid range between 0-1, a1+a2+a3 != 1 always
2. NAND [30,-20,-20]
3. 
4. z = Theta1 * x; a2 = sig(z); or a2 = sig(x * Theta1), wrong
5. Does not stay the same