<h3>Theory behind the perceptron</h3>
<p>The perceptron learning algorithm was invented in the late 1950s by <a href="http://en.wikipedia.org/wiki/Frank_Rosenblatt">Frank Rosenblatt</a>. It belongs to the class of linear classifiers, this is, for a data set classified according to binary categories (which we will assume to be labeled +1 and -1), the classifier seeks to divide the two classes by a linear separator. The separator is a <em>(n-1)</em>-dimensional hyperplane in a <em>n</em>-dimensional space, in particular it is a line in the plane and a plane in the 3-dimensional space.</p>
<p>Our data set will be assumed to consist of <em>N</em> observations characterized by <em>d</em> features or attributes,

$$
x_n = (x_1 \ldots x_d)
$$

for $n = (1 \ldots N)$. The problem of binary classifying these data points can be translated to that of finding a series of weights $w_i$ such that all vectors verifying</p>
<p style="text-align:center;"><img src="https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Csum_%7Bi%3D1%7D%5Ed+w_i+x_i+%3C+b&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;displaystyle &#92;sum_{i=1}^d w_i x_i &lt; b" title="&#92;displaystyle &#92;sum_{i=1}^d w_i x_i &lt; b" class="latex" /></p>
<p>are assigned to one of the classes whereas those verifying</p>
<p style="text-align:center;"><img src="https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Csum_%7Bi%3D1%7D%5Ed+w_i+x_i+%3E+b&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;displaystyle &#92;sum_{i=1}^d w_i x_i &gt; b" title="&#92;displaystyle &#92;sum_{i=1}^d w_i x_i &gt; b" class="latex" /></p>
<p>are assigned to the other, for a given threshold value $b$. If we rename $b = w_0$ and introduce an artificial coordinate $x_0 = 1$ in our vectors ${x}_n$, we can write the perceptron separator formula as</p>
<p style="text-align:center;"><img src="https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+h%28%5Cmathbf%7Bx%7D%29+%3D+%5Cmathrm%7Bsign%7D%5Cleft%28%5Csum_%7Bi%3D0%7D%5Ed+w_i+x_i%5Cright%29+%3D+%5Cmathrm%7Bsign%7D%5Cleft%28+%5Cmathbf%7Bw%7D%5E%7B%5Cmathbf%7BT%7D%7D%5Cmathbf%7Bx%7D%5Cright%29&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;displaystyle h(&#92;mathbf{x}) = &#92;mathrm{sign}&#92;left(&#92;sum_{i=0}^d w_i x_i&#92;right) = &#92;mathrm{sign}&#92;left( &#92;mathbf{w}^{&#92;mathbf{T}}&#92;mathbf{x}&#92;right)" title="&#92;displaystyle h(&#92;mathbf{x}) = &#92;mathrm{sign}&#92;left(&#92;sum_{i=0}^d w_i x_i&#92;right) = &#92;mathrm{sign}&#92;left( &#92;mathbf{w}^{&#92;mathbf{T}}&#92;mathbf{x}&#92;right)" class="latex" /></p>
<p>Note that $w^Tx$ is notation for the <a href="http://en.wikipedia.org/wiki/Scalar_product">scalar product</a> between vectors $w$ and $x$. Thus the problem of classifying is that of finding the vector of weights $w$ given a training data set of <em>N</em> vectors $x$ with their corresponding labeled classification vector $(y_1 \ldots y_N)$

<h3>The perceptron learning algorithm (PLA)</h3>
<p>The learning algorithm for the perceptron is online, meaning that instead of considering the entire data set at the same time, it only looks at one example at a time, processes it and goes on to the next one. The algorithm starts with a guess for the vector <img src="https://s0.wp.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;mathbf{w}" title="&#92;mathbf{w}" class="latex" /> (without loss of generalization one can begin with a vector of zeros). <a href="https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png"><img data-attachment-id="555" data-permalink="https://datasciencelab.wordpress.com/2014/01/10/machine-learning-classics-the-perceptron/perceptron_update/" data-orig-file="https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png?w=830" data-orig-size="289,293" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="perceptron_update" data-image-description="" data-medium-file="https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png?w=830?w=289" data-large-file="https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png?w=830?w=289" class="alignright size-full wp-image-555" alt="perceptron_update" src="https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png?w=830" srcset="https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png 289w, https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png?w=148 148w" sizes="(max-width: 289px) 100vw, 289px"   /></a>It then assesses how good of a guess that is by comparing the predicted labels with the actual, correct labels (remember that those are available for the training test, since we are doing supervised learning). As long as there are misclassified points, the algorithm corrects its guess for the weight vector by updating the weights in the correct direction, until all points are correctly classified.</p>
<p>That direction is as follows: given a labeled training data set, if <img src="https://s0.wp.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;mathbf{w}" title="&#92;mathbf{w}" class="latex" /> is the guessed weight vector and <img src="https://s0.wp.com/latex.php?latex=%5Cmathbf%7Bx%7D_n&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;mathbf{x}_n" title="&#92;mathbf{x}_n" class="latex" /> is an incorrectly classified point with <img src="https://s0.wp.com/latex.php?latex=%5Cmathbf%7Bw%7D%5E%7B%5Cmathbf%7BT%7D%7D%5Cmathbf%7Bx%7D_n+%5Cneq+y_n&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;mathbf{w}^{&#92;mathbf{T}}&#92;mathbf{x}_n &#92;neq y_n" title="&#92;mathbf{w}^{&#92;mathbf{T}}&#92;mathbf{x}_n &#92;neq y_n" class="latex" />, then the weight <img src="https://s0.wp.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;mathbf{w}" title="&#92;mathbf{w}" class="latex" /> is updated to <img src="https://s0.wp.com/latex.php?latex=%5Cmathbf%7Bw%7D+%2B+y_n+%5Cmathbf%7Bx%7D_n&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;mathbf{w} + y_n &#92;mathbf{x}_n" title="&#92;mathbf{w} + y_n &#92;mathbf{x}_n" class="latex" />. This is illustrated in the plot on the right, taken from <a href="http://www.mblondel.org/journal/2010/10/31/kernel-perceptron-in-python/">this clear article on the perceptron</a>.</p>
<p>A nice feature of the perceptron learning rule is that if there exist a set of weights that solve the problem (i.e. if the data is linearly separable), then the perceptron will find these weights.</p>