# Lab 05 - Perceptron

### Objective 

The objective of this lesson is to provide an understanding of the Perceptron algorithm, its working principle, and how to implement it from scratch as well as using scikit-learn. By the end of this lesson, you will be able to design and implement a Perceptron model to classify linearly separable data, and evaluate the performance of the model using various metrics.

### Learning Outcomes

* Understand the working principle of the Perceptron algorithm
* Know how to implement the Perceptron algorithm from scratch using Python
* Know how to use scikit-learn library to apply the Perceptron algorithm
* Understand how to train and evaluate a Perceptron model using various performance metrics


## Introduction to Artificial Neural Network


Artificial neural networks are inspired by the biological neurons within the human body which activate under certain circumstances resulting in a related action performed by the body in response.

An artificial neuron network (neural network) is a computational model that mimics the way nerve cells work in the human brain.

# Basic Structure of ANNs

The human brain is composed of 86 billion nerve cells called neurons. They are connected to other thousand cells by Axons. Stimuli from external environment or inputs from sensory organs are accepted by dendrites. These inputs create electric impulses, which quickly travel through the neural network. A neuron can then send the message to other neuron to handle the issue or does not send it forward.

![image.png](attachment:image.png)

ANNs are composed of multiple nodes, which imitate biological neurons of human brain. The neurons are connected by links and they interact with each other. The nodes can take input data and perform simple operations on the data. The result of these operations is passed to other neurons. The output at each node is called its activation or node value.

You can consider an artificial neuron as a mathematical model inspired by a biological neuron.

![image.png](attachment:image.png)

* A biological neuron receives its input signals from other neurons through dendrites (small fibers). Likewise, a perceptron receives its data from other perceptrons through input neurons that take numbers.


* The connection points between dendrites and biological neurons are called synapses. Likewise, the connections between inputs and perceptrons are called weights. They measure the importance level of each input.


* In a biological neuron, the nucleus produces an output signal based on the signals provided by dendrites. Likewise, the nucleus (colored in blue) in a perceptron performs some calculations based on the input values and produces an output.


* In a biological neuron, the output signal is carried away by the axon. Likewise, the axon in a perceptron is the output value which will be the input for the next perceptrons.

# Perceptron


The Perceptron algorithm is a two-class (binary) classification machine learning algorithm. It is a type of neural network model, perhaps the simplest type of neural network model. It consists of a single node or neuron that takes a row of data as input and predicts a class label.

# The structure of a perceptron

The following image shows a detailed structure of a perceptron. In some contexts, the bias, __b__ is denoted by __w0__. The input, __x0__ always takes the value 1. So, __b*1 = b__.

![image.png](attachment:image.png)

A perceptron takes the inputs, __x1, x2, …, xn,__ multiplies them by weights, __w1, w2, …, wn__ and adds the bias term, __b__, then computes the linear function, z on which an activation function, __f__ is applied to get the output, __y__.

When drawing a perceptron, we usually ignore the bias unit for our convenience and simplify the diagram as follows. But in calculations, we still consider the bias unit.

![image.png](attachment:image.png)

# Inside a perceptron


A perceptron usually consists of two mathematical functions.


### Perceptron's linear function

This is also called the linear component of the perceptron. It is denoted by z. Its output is the weighted sum of the inputs plus bias unit and can be calculated as follows.


![image.png](attachment:image.png)

* The x1, x2, …, xn are inputs that take numerical values. There can be several (finite) inputs for a single neuron. They can be raw input data or outputs of the other perceptrons.


* The w1, w2, …, wn are weights that take numerical values and control the level of importance of each input. The higher the value, the more important the input.


* w1.x1 + w2.x2 + … + wn.xn is called the weighted sum of inputs.


* The b is called the bias term or bias unit that also takes a numerical value. It is added to the weighted sum of inputs. The purpose of including a bias term is to shift the activation function of each perceptron to not get a zero value. In other words, if all x1, x2, …, xn inputs are 0, the z is equal to the value of bias.


The weights and biases are called the parameters in a neural network model. The optimal values for those parameters are found during the learning (training) process of the neural network.

### Perceptron’s non-linear (activation) function

This is also called the non-linear component of the perceptron. It is denoted by f. It is applied on z to get the output y based on the type of activation function we use.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

The function f can be a different type of activation function.

![image.png](attachment:image.png)

consider the following binary step activation function which is also known as the threshold activation function. We can set any value to the threshold and here we specify the value 0.



![image.png](attachment:image.png)

We say that a neuron or perceptron fires (or activates) only when the value of z exceeds the threshold value, 0. In other words, a neuron outputs 1 (fires or activates) if the value of z exceeds the threshold value, 0. Otherwise, it outputs 0.

## Calculations inside a perceptron

Let’s perform a simple calculation inside a perception. Imagine that we have 3 inputs with the following values.

__x1=2,   x2=3   and   x3=1__

Because we have 3 inputs, we also have 3 weights that control the level of importance of each input. Assume the following values for the weights.


__w1=0.5, w2=0.2 and w3=10__

We also have the following value for the bias unit.

__b=2__


Let’s calculate the linear function, z.

z = (0.5 * 2 + 0.2 * 3 + 10 * 1) + 2

z = 13.6

The activation function takes the output of z (13.6) as its input and calculates the output y based on the type of activation function we use. For now, we use the sigmoid activation function defined below.


![image.png](attachment:image.png)

y = sigmoid(13.6)


y = 0.999


y ~ 1



The entire calculation process can be denoted in the following diagram. For ease of understanding, we also denote the bias term in a separate node.

![image.png](attachment:image.png)

## Stochastic Gradient Descent

Gradient Descent is the process of minimizing a function by following the gradients of the cost function.


This involves knowing the form of the cost as well as the derivative so that from a given point you know the gradient and can move in that direction, e.g. downhill towards the minimum value.


In machine learning, we can use a technique that evaluates and updates the weights every iteration called stochastic gradient descent to minimize the error of a model on our training data.


__The way this optimization algorithm works is that each training instance is shown to the model one at a time. The model makes a prediction for a training instance, the error is calculated and the model is updated in order to reduce the error for the next prediction.__


This procedure can be used to find the set of weights in a model that result in the smallest error for the model on the training data.


For the Perceptron algorithm, each iteration the weights (w) are updated using the equation:


__w = w + learning_rate * (expected - predicted) * x__

Where w is weight being optimized, learning_rate is a learning rate that you must configure (e.g. 0.01), (expected – predicted) is the prediction error for the model on the training data attributed to the weight and x is the input value.

## Algorithm

1. Initialize the weights to small random values.


2. For each input in the training dataset, calculate the dot product of the input features and the weights, and pass the result through an activation function to get the output prediction.


3. Calculate the prediction error as the difference between the predicted output and the true output.


4. Update the weights using the prediction error and a learning rate, according to the formula:
    __new_weight = old_weight + (learning_rate * error * input_feature)__
    
    
5. Repeat steps 2-4 for all inputs in the training dataset, for a fixed number of epochs or until the weights converge to a stable solution.


6. To classify new input data, calculate the dot product of the input features and the learned weights, and pass the result through the activation function to get the final output prediction.

## Making Prediction


The first step is to develop a function that can make predictions.



This will be needed both in the evaluation of candidate weights values in stochastic gradient descent, and after the model is finalized and we wish to start making predictions on test data or new data.


Below is a function named predict() that predicts an output value for a row given a set of weights.


The first weight is always the bias as it is standalone and not responsible for a specific input value.

### Make a prediction with weights

In [1]:
def predict(row, weights):
    activation = weights[0]
    for i in range(len(row)-1):
        activation += weights[i + 1] * row[i]
    return 1.0 if activation >= 0.0 else 0.0

The predict function takes in a row of input data and a set of weights for a perceptron, and returns a prediction for that input data based on the current weight values.

We can contrive a small dataset to test our prediction function.

<table>
  <tr>
    <th>X1</th>
    <th>X2</th>
    <th>Y</th>
  </tr>
  <tr>
    <td>2.7810836</td>
    <td>2.550537003</td>
    <td>0</td>
  </tr>
  <tr>
    <td>1.465489372</td>
    <td>2.362125076</td>
    <td>0</td>
  </tr>
  <tr>
    <td>3.396561688</td>
    <td>4.400293529</td>
    <td>0</td>
  </tr>
  <tr>
    <td>1.38807019</td>
    <td>1.850220317</td>
    <td>0</td>
  </tr>
  <tr>
    <td>3.06407232</td>
    <td>3.005305973</td>
    <td>0</td>
  </tr>
  <tr>
    <td>7.627531214</td>
    <td>2.759262235</td>
    <td>1</td>
  </tr>
  <tr>
    <td>5.332441248</td>
    <td>2.088626775</td>
    <td>1</td>
  </tr>
  <tr>
    <td>6.922596716</td>
    <td>1.77106367</td>
    <td>1</td>
  </tr>
  <tr>
    <td>8.675418651</td>
    <td>-0.242068655</td>
    <td>1</td>
  </tr>
  <tr>
    <td>7.673756466</td>
    <td>3.508563011</td>
    <td>1</td>
  </tr>
</table>

In [3]:
# test predictions
dataset = [[2.7810836,2.550537003,0],
 [1.465489372,2.362125076,0],
 [3.396561688,4.400293529,0],
 [1.38807019,1.850220317,0],
 [3.06407232,3.005305973,0],
 [7.627531214,2.759262235,1],
 [5.332441248,2.088626775,1],
 [6.922596716,1.77106367,1],
 [8.675418651,-0.242068655,1],
 [7.673756466,3.508563011,1]]

In [7]:
weights = [-0.1, 0.20653640140000007, -0.23418117710000003]
for row in dataset:
    prediction = predict(row, weights)
    print("Expected=%d, Predicted=%d" % (row[-1], prediction))

Expected=0, Predicted=0
Expected=0, Predicted=0
Expected=0, Predicted=0
Expected=0, Predicted=0
Expected=0, Predicted=0
Expected=1, Predicted=1
Expected=1, Predicted=1
Expected=1, Predicted=1
Expected=1, Predicted=1
Expected=1, Predicted=1


![image.png](attachment:image.png)

Now we are ready to implement stochastic gradient descent to optimize our weight values.

### Training Network Weights

We can estimate the weight values for our training data using stochastic gradient descent.

Stochastic gradient descent requires two parameters:

* __Learning Rate:__ Used to limit the amount each weight is corrected each time it is updated.It is usually set as a small positive number, typically in the range of 0.0 to 1.0. Choosing the optimal learning rate is important as a too small value will lead to a slow convergence and a too high value may cause the algorithm to overshoot the optimal solution. The optimal learning rate depends on the problem and can be found through experimentation and tuning.



* __Epochs:__ The number of times to run through the training data while updating the weight. These, along with the training data will be the arguments to the function.


There are 3 loops we need to perform in the function:
* Loop over each epoch.
* Loop over each row in the training data for an epoch.
* Loop over each weight and update it for a row in an epoch.

As you can see, we update each weight for each row in the training data, each epoch.

Weights are updated based on the error the model made. The error is calculated as the difference between the expected output value and the prediction made with the candidate weights. There is one weight for each input attribute, and these are updated in a consistent way, for example: 

__w(t+1) = w(t) + learning_rate * (expected(t) - predicted(t)) * x(t)__


This formula represents how the weights in a linear model are updated during training using the perceptron learning algorithm.

* w(t+1) represents the new weight value for a particular feature (x) at time t+1
* w(t) represents the current weight value for that feature at time t
* learning_rate is a hyperparameter that determines the step size of each update
* (expected(t) - predicted(t)) is the error, or the difference between the true label (expected) and the predicted label using the current weights (predicted) for a particular example at time t
* x(t) is the value of the input feature for that example at time t

The bias is updated in a similar way, except without an input as it is not associated with a specific input value: 

__bias(t+1) = bias(t) + learning_rate * (expected(t) - predicted(t))__

Now we can put all of this together. Below is a function named __train_weights ()__ that calculates weight values for a training dataset using stochastic gradient descent.

### Estimate Perceptron weights using stochastic gradient descent
 

In [20]:
def train_weights(train, l_rate, n_epoch):
    
    weights = [0.0 for i in range(len(train[0]))]
    
    for epoch in range(n_epoch):
        
        sum_error = 0.0
        
        for row in train:
            
            prediction = predict(row, weights)
            
            error = row[-1] - prediction
            
            sum_error += error**2
            
            weights[0] = weights[0] + l_rate * error  #bias(t+1) = bias(t) + learning_rate * (expected(t) - predicted(t))
            
            for i in range(len(row)-1):
                
                weights[i + 1] = weights[i + 1] + l_rate * error * row[i] #w(t+1) = w(t) + learning_rate * (expected(t) - predicted(t)) * x(t)
                
        print('epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error))
        
    return weights


In [19]:
l_rate = 0.1
n_epoch = 5
weights = train_weights(dataset, l_rate, n_epoch)
print(weights)

0
epoch=0, lrate=0.100, error=2.000
1
epoch=1, lrate=0.100, error=1.000
2
epoch=2, lrate=0.100, error=0.000
3
epoch=3, lrate=0.100, error=0.000
4
epoch=4, lrate=0.100, error=0.000
[-0.1, 0.20653640140000007, -0.23418117710000003]


#### Code Explanation

Trains the weights of a perceptron model on a given training dataset. Here is a step-by-step explanation of the code:

1. __The function takes three arguments:__ 
           train, which is the training dataset, 
           l_rate, which is the learning rate, and 
           n_epoch, which is the number of epochs (iterations) to train for.


2. A list of initial weights is created with the same length as the number of columns in the training dataset. This is done using a list comprehension.


3. A loop is started over the range of n_epoch, which is the number of epochs to train for.


4. A variable sum_error is initialized to 0.0. This will be used to keep track of the total error made by the model in each epoch.


5. Another loop is started over each row in the training dataset.


6. The predict function is called with the current row of data and the current weights of the model to make a prediction.


7. The error is calculated as the difference between the expected output value (which is the last column of the row) and the predicted output value.


8. The error is squared and added to the sum_error variable.


9. The bias weight (which is the first weight in the weights list) is updated by adding the product of the learning rate and the error.


10. Another loop is started over each input feature of the row.


11. The weight for the current input feature is updated by adding the product of the learning rate, the error, and the input feature value.


12. After all rows in the training dataset have been processed, the total error for the epoch is printed to the console.

13. After all epochs have been processed, the final weights of the model are returned.

You can see how the problem is learned very quickly by the algorithm.