
## resources
* [I am trask blog - simple introduction to NN](http://iamtrask.github.io/)
* [Newral bnetwork tutorial](http://www.existor.com/en/news-neural-networks.html) - walk all the way
* [Andrew Ng ML course](https://www.coursera.org/learn/machine-learning/)
* [Pedro domingos course](https://www.coursera.org/course/machlearning)

In [2]:
import numpy as np
import matplotlib 

## Work of a single neuron on a single example

Consider a simple network with input units with nodes marked by $x_i$, output unit with nodes marked by $y_i$ and hidden units marked by $h_i$. A value of a specific node, say $h_1$ is evalueted from the values of all input nodes 
> $h_i = a(\sum_{i=1}^{3} x_i W^{(1)}_{i1} x_i$ ) 
 
where $a$ is an **activation function**. That is we first computer a scalar product between the weights and the values of the input nodes. Then we use the compuyted value as the argument to the activation function.


In [26]:
def sigmoid(x):
    return 1./(1.+np.exp(-x))

def rectifier(x):
    return np.array([max(xv,0.0) for xv in x])
    

x     = np.array([1.0,0,0])
w     = np.array([0.2,-0.03,0.14])
print ' Scalar product between unit and weights ',x.dot(w)
print ' Values of Sigmoid activation function   ',sigmoid(x.dot(w))
print ' Values of tanh    activation function   ',np.tanh(x.dot(w))


 Scalar product between unit and weights  0.2
 Values of Sigmoid activation function    0.549833997312
 Values of tanh    activation function    0.197375320225


### Examples of activation functions

Example of some popular [activation functions](https://en.wikipedia.org/wiki/Activation_function):
 * [Rectifier](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)

In [25]:
import pylab

x  = np.linspace(-2,2,100) # 100 linearly spaced numbers
s  = sigmoid(x)    # computing the values of 
th = np.tanh(x)    # computing the values of 
re = rectifier(x)  # computing the values of rectifier

# compose plot
pylab.plot(x,s) 
pylab.plot(x,s,'co',label='Sigmoid') # Sigmoid 
pylab.plot(x,th,label='tanh')     # tanh
pylab.plot(x,re,label='rectifier')     # tanh
pylab.legend()
pylab.show() # show the plot

### Matrix notations

It is convenient to arrange the weights in a matrix so that all nodes can be computed at once:

> $\mathbf{h} = a(\mathbf{W^{(1)}} \cdot \mathbf{x})$

The output values are computed by similarly multplying the values oh $h$ by another weight matrix,

> $\mathbf{y} = \mathbf{W^{(2)}} \cdot \mathbf{h}$

We do not use the activation function this time so we actually use a **linear regressor** network. The values of $y$ arer arbitrary real numbers. For classification problems we want to ocnvert them into probabilities. This is achieved by using the [softmax](https://en.wikipedia.org/wiki/Softmax_function) function.

> $s_i = \frac{e^{y_i}}{\sum_i e^{y_i}}$

The element $s_i$ is the probability that the label of the output is $i$. This is indeed the same expression utilized by [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) for classification of many labels.

### Feed forward cycle for a simple network with input unit, one hidden unit and an output unit


In [40]:
def softmax(y):
    s = np.sum(np.exp(y))
    return np.exp(y)/s

x  = np.array([1.,0,0])
W1 = np.array([[0.2,0.15,-0.01],[0.01,-0.1,-0.06],[0.14,-0.2,-0.03]])
xw = W1.T.dot(x)
h1 = np.tanh(xw)

W2 = np.array([[0.08,0.11,-0.3],[0.1,-0.15,0.08],[0.1,0.1,-0.07]])
y  = W2.T.dot(h1)
s  = softmax(y)

print ' Value after scalar product (transfer function) ',xw
print ' and after activation function                  ',np.tanh(xw)
print ' output value                                   ',y
print ' softmax                                        ',s

print '\n'
for i in [0,1,2]:
    print 'The probablity for classifying to label ',i,' is ',s[i]

 Value after scalar product (transfer function)  [ 0.2   0.15 -0.01]
 and after activation function                   [ 0.19737532  0.14888503 -0.00999967]
 output value                                    [ 0.02967856 -0.00162144 -0.04660182]
 softmax                                         [ 0.34533473  0.33469317  0.3199721 ]


The probablity for classifying to label  0  is  0.345334733084
The probablity for classifying to label  1  is  0.334693165649
The probablity for classifying to label  2  is  0.319972101267


### Computing errors
Suppose that the expected output is $y^* = (0,1,0)$, we can now compute the error vector $e=s-y^*$. With this error, we can now compute a loss function. Popular loss functions are:
* Absolute loss $\sum_i |e_i|$
* Square loss $\sum_i e_i^2$
* Cross entropy loss $-\sum_i y_i^*\log{s_i}$. The rationale here is that the output of the softmax function is a probability distribution and we can also view the real label vector $y$ as a probability distribution (1 for the corerct label and 0 for all other labels). The cross entropy function is a common way to measure difference between distributions.

In [46]:
def abs_loss(e):
    return np.sum(np.abs(e))

def sqr_loss(e):
    return np.sum(e**2)

def cross_entropy_loss(y_estimated,y_real):
    return -np.sum(y_real*np.log(y_estimated))

ystar = np.array([0.,1.,0])
err   = s - ystar

print ' Error                                          ',err
print ' Absolute loss                                  ',abs_loss(err)
print ' Square loss                                    ',sqr_loss(err)
print ' Cross entropy loss                             ',cross_entropy_loss(s,ystar)



 Error                                           [ 0.34533473 -0.66530683  0.3199721 ]
 Absolute loss                                   1.3306136687
 Square loss                                     0.664271407297
 Cross entropy loss                              1.09454109031


## Work of a single neuron on 4 examples

In [57]:
X     = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y     = np.array([[0,1,1,0]]).T

np.random.seed(1)
w0 = np.random.random((3,1))

print 'The data matrix (4 examples in 3 dimensions)\n',X
print '\n The target vector y\n',y
print '\nRandom weights \n',w0
print '\nValues of activation function\n',sigmoid(np.dot(X,w0))

The data matrix (4 examples in 3 dimensions)
[[0 0 1]
 [0 1 1]
 [1 0 1]
 [1 1 1]]

 The target vector y
[[0]
 [1]
 [1]
 [0]]

Random weights 
[[  4.17022005e-01]
 [  7.20324493e-01]
 [  1.14374817e-04]]

Values of activation function
[[ 0.50002859]
 [ 0.67270365]
 [ 0.60279781]
 [ 0.75721315]]


## Backpropagation 

In [63]:
X     = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y     = np.array([[0,0,1,1]]).T

np.random.seed(1)
w0 = 2.0*np.random.random((3,1))-1.

for iter in xrange(10000):

    # forward propagation
    l0 = X
    l1 = sigmoid(np.dot(l0,w0))

    # how much did we miss?
    l1_error = y - l1

    # multiply how much we missed by the 
    # slope of the sigmoid at the values in l1
    l1_delta = l1_error * sigmoid(l1,True)

    # update weights
    w0 += np.dot(l0.T,l1_delta)

print "Output After Training:"
print l1


Output After Training:
[[ 0.00966449]
 [ 0.00786506]
 [ 0.99358898]
 [ 0.99211957]]
