In [last week's blog](http://qingkaikong.blogspot.com/2016/11/machine-learning-3-artificial-neural.html), we talked about the basics of Artificial Neural Network (ANN), and gave an intuitive summary and example. Hope you already get a sense of how ANN works. This week, we will continue to explore more of the ANN, and implement a simple version of it so that you will have a better understanding.  

Key concepts you will get from this blog: Inputs, Weights, Outputs, Targets, Activation Function, Error, Bias input, Learning rate.   

Let's start with the simple type of Artificial Neural Network - [Perceptron](https://en.wikipedia.org/wiki/Perceptron), which developed back to the late 1950s; its first implementation, in custom hardware, was one of the first artificial neural networks to be produced. You can also treat it as a two layer neural network without the hidden layers. Even though it has limitations, it contains most of the parts we want to cover for the ANN.   

Let's see this example with 4 data samples, this is the input of the data. Each of the sample has 3 features. The target of the data is simplely 0 or 1 for two different classes, i.e. class 0, and class 1. We want to build and train a simple Perceptron model that can output the correct target by feeding into the 3 features from the data samples. See the following table. This example is modified from this [great post](http://iamtrask.github.io/2015/07/12/basic-python-network/).    

|Feature1|Feature2|Feature3|Target|
|:------:|:------:|:------:|:----:|
|    0   |    0   |    1   |   0  |
|    1   |    1   |    1   |   1  |
|    1   |    0   |    1   |   1  |
|    0   |    1   |    1   |   0  |

Let's have a look of the structure of our model.

<img src="https://raw.githubusercontent.com/qingkaikong/blog/master/39_ANN_part2_step_by_step/figures/figure1_perceptron_structure.jpg" width="600"/>  

We can see from the figure that we have two layers in this model: the input layer and the output layer. For the input layer, it has 3 features that connect to the output layer via 3 weights. The steps we will take to train this model is:
1. Initialize the weights to small random numbers with both positive or negative values.
2. For each training sample:  
       Calculate the output value.
       Update the weights based on the error.

Before we look at the code how to implement this, we need go through some concepts:  

### Bias term
But we also see one extra node in the input layer with 1 in it, also connected to the output layer by weight $\omega_0$. Why we add 1 to the input? Think about the case if all our features are 0, and no matter what weight we have, we will always have 0 to the output layer. Therefore, adding this 1 extra node can avoid this problem. It is usually called bais term.   

### Activation function   
The output layer only has one node, that sum all the passed input and determine whether the output should fire or not. Usually we use 1 indicate fire and 0 indicate not fire. In this case, we can see the sum of all the input via the weights are: z = $1*\omega_0 + feature1*\omega_1 + feature2*\omega_2 + feature3*\omega_3$. Since this number can be anything, how can we scale it to a value within 0 to 1, as a probability? Here, the [activation function](https://en.wikipedia.org/wiki/Activation_function) comes to help. The purpose of using the activation function is simply to bring scale the output to a value within 0 to 1. Therefore, you can choose many different activation function, but the most commonly used activation function is the sigmoid (also called logistic) function, which is shown below. The advantage of using this function lies in its derivativity. Since the derivative will be used to update the weights during learning, this property from the sigmoid function is very desirable. The horizontal axis is the z value from the above, it can be any value from negative to positive. The vertical axis is the scaled value from 0 to 1 that we can use as our output. We can see that, if the z value we get is a large positive number, say 10, or a very small negative number, i.e. -10, the scaled value will be either 1 or 0. This is also corresponding to the result that is very clear determined by the network. But if we have z relatively close to 0, for example, -1 to 1, then the scaled value will be around 0.2 to 0.7, which also indicates the network is not so confident about the result. You can have some threshold to decide if the class is 1 or 0. For example, if the threshold is 0.5, then 0.2 will be classified as class 0, and 0.7 will be classified as class 1. 

<img src="https://raw.githubusercontent.com/qingkaikong/blog/master/39_ANN_part2_step_by_step/figures/figure2_sigmoid.jpg" width="600"/>   

Therefore, by input the features to the network, and we can get the perceptron network to classify an object into two different class. But what if the result is wrong by input the features? This is totally possible, since we initialize the weights as random small numbers. To solve this problem, we need the perceptron to have the ability to learn from the data. 

### How to learn from error  

Now let's think one case, if the true class is 1, but the output from the perceptron network is 0. The error is 1 - 0. 

### Learning rate   


In [84]:
import numpy as np

# The activation function, we will use the sigmoid
def sigmoid(x,deriv=False):
    if(deriv==True):
        return x*(1-x)
    return 1/(1+np.exp(-x))

def activation_step_func(x, threshold = 0):
    return np.piecewise(x, [x < threshold, x >= threshold], [0, 1])

# define learning rate
learning_rate = 0.8

# input dataset, note we add bias 1 to the input data
X = np.array([[0,0,1],[1,1,1],[1,0,1],[0,1,1]])
X = np.concatenate((np.ones((len(X), 1)), X), axis = 1)

# output dataset           
y = np.array([[0,1,1,0]]).T

# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)

# initialize weights randomly with mean 0
# there are 4 weights here, and the first 
# one is related with the bias term
weights_0 = 2*np.random.random((4,1)) - 1

# train the network
for iter in xrange(100000):
    
    # loop through all the sample data
    for i in range(len(X)):
        # forward propagation
        layer_0 = X[[i], :]

        layer_1_output = activation_step_func(np.dot(layer_0,weights_0))

        # how much difference? This will be the error of 
        # our estimation and the true value
        layer1_error = y - layer_1_output

        # multiply how much we missed by the
        # slope of the sigmoid at the values in l1
        layer1_delta = learning_rate * layer1_error * layer_0.T

        # update weights
    weights_0 += layer1_delta
print "Output After Training:"
print layer_1_output

Output After Training:
[[ 0.]]


In [91]:
import numpy as np

# The activation function, we will use the sigmoid
def sigmoid(x,deriv=False):
    if(deriv==True):
        return x*(1-x)
    return 1/(1+np.exp(-x))

# define learning rate
learning_rate = 0.8

# input dataset, note we add bias 1 to the input data
X = np.array([[0,0,1],[1,1,1],[1,0,1],[0,1,1]])
X = np.concatenate((np.ones((len(X), 1)), X), axis = 1)

# output dataset           
y = np.array([[0,1,1,0]]).T

# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)

# initialize weights randomly with mean 0
weights_0 = 2*np.random.random((4,1)) - 1

# train the network with 50000 iterations
for iter in xrange(50000):

    # forward propagation
    layer_0 = X

    layer_1_output = sigmoid(np.dot(layer_0,weights_0))

    # how much difference? This will be the error of 
    # our estimation and the true value
    layer1_error = y - layer_1_output

    # multiply how much we missed by the
    # slope of the sigmoid at the values in l1
    layer1_delta = learning_rate * layer1_error * sigmoid(layer_1_output,True)

    # update weights
    weights_0 += np.dot(layer_0.T,layer1_delta)
print "Output After Training:"
print layer_1_output

Output After Training:
[[ 0.00420132]
 [ 0.99624339]
 [ 0.99664366]
 [ 0.00375385]]


## Explain line by line  

**Line 1:** This is import the numpy module, which is a linear algebra library.   

**Line 4:** This block defines the activation function, which is a function to convert any number to a probability between 0 and 1 as we dicussed above.   

**Line 10:** Here we define our learning rate, this will control how fast the network will learn from the data. Usually this learning rate will be a number within 0 - 1, with 0 means the network will not learn at all, and 1 means the network will learn a full speed. 

**Line 13:** This initializes the input dataset as numpy matrix. Each row is a single data sample, and each column corresponds to one features (one of the input nodes). And we also add the bias term 1 in line 14. You can see that we now have 4 input nodes and 4 training examples.   

**Line 17:** This initializes our output dataset. ".T" is the transpose function, which to convert our output data to a column vector. You can see that we have 4 rows and 1 column, corresponds to 4 data samples and 1 output node.  

**Line 21:** Before we generate the random weights, we use a seed to make sure that every time we have the same set of random number generated. This is very useful when we test the algorithm, and compare the results with others. This means that your results and my results should be the same. But in reality when you use the algorithm, you don't need to seed it.  

**Line 24:** This initializes our weights to connect the input layer to the output layer. Since we have 4 input nodes (including the bias term), we initialize the random weights as dimension (4,1). Also note that the random numbers we initialized are within -1 to 1, with a mean of 0. There is quite a bit of theory that goes into weight initialization. For now, just take it as a best practice that it's a good idea to have a mean of zero in weight initialization.     




http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html

## References  

[A Neural Network in 11 lines of Python](http://iamtrask.github.io/2015/07/12/basic-python-network/)  
[Machine learning - An Algorithmic Perspective](https://seat.massey.ac.nz/personal/s.r.marsland/MLBook.html)

http://neuralnetworksanddeeplearning.com/

In [None]:
<img src="http://sebastianraschka.com/images/blog/2015/singlelayer_neural_networks_files/perceptron_activation.png" width="600"/> 