$$ \LaTeX \text{ command declarations here.}
\newcommand{\N}{\mathcal{N}}
\newcommand{\R}{\mathbb{R}}
\renewcommand{\vec}[1]{\mathbf{#1}}
\newcommand{\norm}[1]{\|#1\|_2}
\newcommand{\d}{\mathop{}\!\mathrm{d}}
\newcommand{\qed}{\qquad \mathbf{Q.E.D.}}
\newcommand{\vx}{\mathbf{x}}
\newcommand{\vy}{\mathbf{y}}
\newcommand{\vt}{\mathbf{t}}
\newcommand{\vb}{\mathbf{b}}
\newcommand{\vw}{\mathbf{w}}
$$

In [1]:
from __future__ import division;
import numpy as np;
from matplotlib import pyplot as plt;
from matplotlib import colors
import matplotlib as mpl;
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import mlab;
from matplotlib import gridspec;
import pandas as pd
from IPython.display import display

if "bmh" in plt.style.available: plt.style.use("bmh");

import scipy as scp;

from scipy import linalg

import scipy.stats;

# scikit-learn
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# python
import random;

# warnings
import warnings
warnings.filterwarnings("ignore")

# rise config
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
              'theme': 'simple',
              'start_slideshow_at': 'selected',
              'transition':'fade',
              'scroll': False

});

## Perceptron Algorithm

- The Perceptron is the most basic model of a training a linear predictor with sequential update steps:

- Given data $\{\vx_n, t_n \}_{n=1}^N$, $t_n \in \{-1, 1\}$, here is how perceptron works:

    - **Initialize**: Set $\vec{w}_1 = \mathbf{0}$;

    - **For:** $n=1,2,\ldots, N$
        - Observe $\vec{x}_n$, predict $y_n = \text{sign}(\vec{w}_t ^{\top} \phi(\vec{x}_n)) $
        - Receive $t_n \in \{-1,1\}$, update:
            $$
            \vec{w}_{n+1} = \begin{cases}
            \vec{w}_n & \text{if } t_n\vec{w}_n ^T \phi(\vec{x}_n) > 0 &\text{Correct Prediction}\\
            \vec{w}_n + t_n \phi(\vec{x}_n) & \mbox{otherwise} & \text{Incorrect Prediction}
            \end{cases}
            $$
    - **End**
        
- Note that we could repeat the for-loop multiple loops until classification error is less than certain threshold.

### Perceptron: Intuition

- We have update
    $$
    \vec{w}_{n+1} = \begin{cases}
    \vec{w}_n & \text{if } t_n\vec{w}_n ^T \phi(\vec{x}_n) > 0 &\text{Correct Prediction}\\
    \vec{w}_n + t_n \phi(\vec{x}_n) & \mbox{otherwise} & \text{Incorrect Prediction}
    \end{cases}
    $$
    
- The more *positive* $t_n\vec{w}_n ^T \phi(\vec{x}_n)$ is, the more robust performance $\vec{w}_n$ has on data $\vx_n$

- **Intuition**
    - When $\vw_n$ gives incorrect prediction for $\vx_n$, i.e. $t_n\vec{w}_n ^T \phi(\vec{x}_n) \leq 0$, above update tells us
        $$
        \begin{align}
        t_n\vec{w}_{n+1} ^T \phi(\vec{x}_n) 
        &= t_n\vec{w}_n ^T \phi(\vec{x}_n) + t_n^2 \phi(\vx_n)^T \phi(\vx_n) \\
        &= t_n\vec{w}_n ^T \phi(\vec{x}_n) + \underbrace{t_n^2 \| \phi(\vx_n) \|^2}_{\text{Non-negative}}
        \end{align}
        $$
    - **Non-negative** term $t_n^2 \| \phi(\vx_n) \|^2$ makes $t_n\vec{w}_{n+1} ^T \phi(\vec{x}_n)$ more likely to be positive.
    - Therefore, $\vec{w}_{n+1}$ is likely to have better performance on $\vx_n$

### Perceptron: Essentially a form of Stochastic Gradient Descent

- Define error function:
    $$
    E(\vec{w}) = \sum_{n=1}^N \max(0, -t_n \vec{w}^T\phi(\vec{x}_n))
    $$

- The derivative of $E(\vec{w})$ on data $\vx_n$ is
    $$
    \nabla_\vw E(\vec{w} | \vx_n) = 
    \begin{cases}
    0 & \text{if } t_n\vec{w} ^T \phi(\vec{x}_n) > 0 \\
    -t_n \phi(\vec{x}_n) & \mbox{otherwise} & 
    \end{cases}
    $$

- Perceptron is equivalent to the following stochastic gradient descent
    - **For**: $n=1,2,\ldots, N$
        - $\vec{w}_\text{new} = \vec{w}_\text{old}-\eta \nabla_\vw E(\vec{w}_{\text{old}} | \vx_n)$
    - **End**
        
    
- Notice the "step size" is $\eta = 1$! This is atypical.
- Perceptron was (originally) viewed as building block of the *neural network* (NN). Indeed, NN often called the Multi-Layer Perceptron (MLP).

### Perceptron: A magical property

- If problem is *linearly separable*, i.e. a hyperplane separates positives/negatives, then Perceptron *will find a separating* $\vec{w}^*$.

- **Theorem**: 
    * Assume that $\|\phi(\vec{x}_n)\| \leq 1$ for all $n$
    * Assume $\exists \vec{w}$, with $\|\vec{w}\|_2 = 1$, such that for all $(\vec{x}_n,t_n)$ that $ t_n\vec{w} ^T \phi(\vec{x}_n) > \gamma$ for some $\gamma > 0$.
    * Then the Perceptron algorithm will find some $\vec{w}^*$ which perfectly classifies all examples
    * The number of updates/mistakes in learning is bounded by $\frac{1}{\gamma^2}$
    
- This is a *margin bound*, notice that it depends on $\gamma$ not the dimension of $\phi(\vec{x})$

- Proof is in the notes

> Remark

> - **Proof Sketch**
    > - Let $\vec{w}_*$ be perfect classifier scaled by $\frac{1}{\gamma}$.
    $$
    \begin{align}
    \frac{1}{\gamma^2} = \norm{\vec{w}_*}^2 & \ge \norm{\vec{w}_* - \mathbf 0}^2 - \norm{\vec{w}_* - \vec{w}_{T+1}}^2 \nonumber \\
    & = \sum \nolimits_{n=1}^T \norm{\vec{w}_* - \vec{w}_{n}}^2 - \norm{\vec{w}_* - \vec{w}_{n+1}}^2  \\
    & = \sum \nolimits_{n \, : \, t_n \vec{w}_n^T \phi(\vec{x}_n) < 0 } \norm{\vec{w}_* - \vec{w}_{n}}^2 - \norm{\vec{w}_* - (\vec{w}_n + t_n \phi(\vec{x}_n)}^2 \\
    & = \sum \nolimits_{n \, : \, t_n \vec{w}_n^T \phi(\vec{x}_n) < 0} 2 \left( \underbrace{t_n (\vec{w}_*^T \phi(\vec{x}_n))}_{\ge 1} \underbrace{- t_n (\vec{w}_n^T \phi(\vec{x}_n))}_{\ge 0} \right) \underbrace{- t_n^2 \norm{\phi(\vec{x}_n)}^2}_{\ge -1} \\
    & \ge \sum \nolimits_{n \, : \, t_n \vec{w}_n^T \phi(\vec{x}_n) < 0} 1 \quad = \quad \text{#mistakes[Perceptron]}
    \end{align}
    $$

> - See [learning theory lecture notes](http://web.eecs.umich.edu/~jabernet/eecs598course/fall2015/web/notes/lec16_110515.pdf) for full details. (Note that we have changed notations a little bit. Index $n$, label $t_n$ and data feature vector $\phi(\vx)$ each corresponds to index $t$, label $y_n$ and data $\vx$ in this reference)