#Chapter 3 - Perceptron#

##Definition of artifical neurons##

Artificial neurons in the context of binary classification make use of a decision function, which we will refer to as $𝛷(z)$ 

This function takes a linear combination of two vectors.

*Note, when we refer to a linear combination of vectors we are speaking specifically about the vector dot product*

There's the weight vector:

$w_m = [w_1 ... w_m]$

and the input (neuron) values:

$x_m = [x_1 ... x_m] $

If the net input of an example like

$x^{(i)}$

is greater than a certain **threshold value** which we refer to as 𝜭 (theta), then we predict class $1$. 

Otherwise we return class $-1$.

This means that the decision function $𝛷(z)$ is a variant of a unit step function.

In the case that:

$𝛷(z) \ge 𝜭$

we return class $+1$

Else if:

$𝛷(z) \lt 𝜭$

we return class $-1$.

In order to calculate $z$ we must initialize a weight zero and an input zero. By convention:

$w_0=-𝜭$

and

$x_0 = 1$


$z$ can now be expressed as:

$z=w_0x_0+w_1x_1...w_mx_m = w^Tx$

$z$ therefore is the **dot product** of **w** and **x**. which we can express as:

$z=w^Tx$

By inserting 𝜭 into the initialized **bias unit** we are able to define $z$ in the manner shown above.

The following diagram illustrates how the decision function works:

![picture](https://drive.google.com/uc?export=view&id=1h3rTJzYASCxLYtf8kX0KIEz8HbIdcOUB)

It is clear from the example that in order for the perceptron to work the two subsets of data must be **linearly separable**. If not, the decision function will never converge on a solution.

***

##The Perceptron Learning Rule##

These neurons are crude appeoximations of "all or nothing" model neurons in the human CNS. They either fire (+1) or they don't (-1).

The learning algorithm can be expressed thusly:

1) Initialize the weights to zero or small random numbers.

2) For each training example $x^{(i)}$

  *a. Compute the output value $\hat{y}$.*

  *b. Update the weights.*

$w_j := w_j + Δw_j$

Thist update value of $Δw_j$ is defined by the following formula:

$Δw_j = α(y^{(i)} - \hat{y}^{(i)})x_j^{(i)}$

where $α$ is our learning rate (some hyperparameter float between 0.0 and 1.0) 

and $y^{(i)}$ is the **true class label**, or actual observed data.

${\hat{y}}^{(i)}$ in this case is the **predicted class label**.

In this case the inner formula:

$(y^{(i)}-{\hat{y}}^{(i)})$

represents the **difference** between the predicted data and the evaluation data.

In the case of a two dimensional dataset we would write the update function as this series:

***

$Δw_0 = α(y^{(i)}-output^{(i)})$

$Δw_1 = α(y^{(i)}-output^{(i)})x_1^{(i)}$

$Δw_2 = α(y^{(i)}-output^{(i)})x_2^{(i)}$

$Δw_{m-1} = α(y^{(i)}-output^{(i)})x_{m-1}^{(i)}$

***

*In this case $(m-1)$ is used because $w$ and $x$ are zero-indexed...*

Mathematically, it is proven that the perceptron will always converge given two conditions: linearly separable variables and sufficiently small $α$.

Without linearly separable variables, we must set the maximum number of total iterations **epochs** and a threshold value for total missclassifications. 

Without these conditions the perceptron will never converge.

***

![picture](https://drive.google.com/uc?export=view&id=1tIlgaqk-Qh88r2DwwASAbmNxmqcLAnQZ)

***

The function of the perceptron flow can be summarized as follows:

***

![picture](https://drive.google.com/uc?export=view&id=1cVN1fLFJ_JqXkGBUkl_bRCms-Q93SKH4)

***

The net input function is the solution to $z$, in this case the vector resultant from $w^Tx$.

The threshold function is the familiar $\phi z(θ)$, the step function.

In [None]:
from sklearn import datasets
from numpy.ma.core import shape
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
iris = datasets.load_iris()
print(type(iris))
# In this dataset, column 3 represents petal length and column 4 represents petal width
# In the iris.target array, 0=Setosa, 1=Versicolor, 2=Virginica

X = iris.data[:, [2,3]]
y = iris.target

print(shape(X))

"""
print("The shape of y is: ", shape(y))
print("The shape of X is: ", shape(X))
"""

# We now break the data into 70% training and 30% eval.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

# Standardize the data features (from sklearn)
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

# Verify the data is a compatible shape

"""
print(shape(X_train_std))
print(shape(X_test_std))
print(shape(y_train))
"""

# Train the Perceptron
ppn = Perceptron(max_iter=40, eta0=0.1, random_state=1)
ppn.fit(X_train_std, y_train)

# Test the Perceptron
y_pred = ppn.predict(X_train_std)
print(shape(y_pred))
print(shape(y_test))
#accuracy_score(y_test, y_pred)



<class 'sklearn.utils.Bunch'>
(150, 2)
(105,)
(45,)


In [None]:
# Load the iris dataset
iris = datasets.load_iris()

# Create our X and y data
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

sc = StandardScaler()
sc.fit(X_train)

# Apply the scaler to the X training data
X_train_std = sc.transform(X_train)

# Apply the SAME scaler to the X test data
X_test_std = sc.transform(X_test)

# Create a perceptron object with the parameters: 40 iterations (epochs) over the data, and a learning rate of 0.1
ppn = Perceptron(max_iter=300000, eta0=0.000001, random_state=0)

# Train the perceptron
ppn.fit(X_train_std, y_train)

# Apply the trained perceptron on the X data to make predicts for the y test data
y_pred = ppn.predict(X_test_std)

# View the accuracy of the model, which is: 1 - (observations predicted wrong / total observations)
print(shape(y_pred))
print(shape(y_test))
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

(45,)
(45,)
Accuracy: 0.84
