# Perceptron 

*This is based on ESLII 4.5.1*

## Problem Formulation

The perceptron learning algorithm tries to find a separating hyperplane by minimizing the distance of misclassified
points to the decision boundary $w^T x + b = 0$

When $x_i$ lies exactly on the decision boundary it will be classified as positive, in order to prevent this, we will use $f(x_i)y_i > 0$ to check if the point has been misclassified or not

## Goal

The goal is to minimize the objective:

$D(w, b) = - \sum_{i\in M} y_i(w^T x_i + b)$

if misclassified:
1. $y_i =1, f(x_i) < 0 \implies -y_i f(x_i) > 0 \implies$ increases the cost
2. $y_i =-1, f(x_i) > 0 \implies -y_i f(x_i) > 0 \implies$ increases the cost

Where M is the set of all misclassified points

$D(w, b)$ is non-negative and proportional to the distance of the misclassified points to the decision boundary defined by $w^T x + b = 0$

## Why Proportional

For any misclassified point x, let $x_0$ be its projection on $w^T x + b = 0 \implies w^T x_0 + b = 0$ 

Let r be the distance from x to $w^T x + b = 0$

$\implies w^T x_0 = -b, x-x_0 = \frac{w}{\|w\|} r$

$\implies w^Tx - w^T x_0 = \frac{w^Tw}{\|w\|} r$

$\implies w^Tx + b = \| w\| r$

$\implies r = \frac{1}{\| w\|} f(x)$

Thus, r, the distance to the decision boundary is proportional to f(x) which is the prediction.

## Update rule

By taking the gradients wrt $w, b$

we have:

$\frac{\partial D(w, b)}{\partial w} = - \sum_{i\in M} x_i y_i$

$\frac{\partial D(w, b)}{\partial b} = - \sum_{i\in M} y_i$

Instead of taking the sum, the misclassified observations are visited in some sequence, and the parameters $w, b$ are updated via:

$\begin{pmatrix}
w^{n+1}\\
b^{n+1}
\end{pmatrix}\rightarrow
\begin{pmatrix}
w^{n}\\
b^{n}
\end{pmatrix} + \alpha
\begin{pmatrix}
y_i x_i\\
y_i
\end{pmatrix}
$

where $\alpha$ here is the learning rate.

## Problems with Perceptron

1. When the data are separable, there are many solutions adn which one is found depends on the starting values.
2. The finite number of steps can be very large. The smaller the gap, the longer the time to find it.
3. When the data are not separable, the algorithm will not converge, and cycles develop. The cycles can be long and therefore hard to detect.

In [26]:
import numpy as np

class Perceptron:

    def __init__(self, alpha=1):

        self.alpha = alpha
        self.w = None
        self.b = None

    def fit(self, X, y):

        y = y.reshape(-1, 1)
        n, d = X.shape
        self.w = np.random.randn(d)
        self.b = np.random.randn(1)
        stop = 0

        while stop != 1:

            m = []

            for i in range(n):

                pred = np.dot(self.w, X[i]) + self.b

                if y[i] * pred <= 1:
                    m.append(i)

            # update w and b

            for j in m:

                self.w = self.w + self.alpha * X[j] * y[j]
                self.b = self.b + self.alpha * y[j]

            if not m:
                stop = 1

        return self

    def decision_function(self, X):

        pred_results = np.array([])

        for i in range(X.shape[0]):

            pred_results = np.append(pred_results, np.dot(self.w, X[i]) + self.b)

        return pred_results


    def predict(self, X):

        output_list = self.decision_function(X)
        output_array = np.where(np.array(output_list) >= 1, 1, -1)

        return output_array



In [31]:
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
x = load_iris()['data'][:100]
y = load_iris()['target'][:100]
y = np.where(y == 0, -1, y)

In [32]:
clf = Perceptron().fit(x, y)

In [33]:
accuracy_score(y, clf.predict(x))

1.0