<h3>Theory behind the perceptron</h3>
<p>The perceptron learning algorithm was invented in the late 1950s by <a href="http://en.wikipedia.org/wiki/Frank_Rosenblatt">Frank Rosenblatt</a>. It belongs to the class of linear classifiers, this is, for a data set classified according to binary categories (which we will assume to be labeled +1 and -1), the classifier seeks to divide the two classes by a linear separator. The separator is a <em>(n-1)</em>-dimensional hyperplane in a <em>n</em>-dimensional space, in particular it is a line in the plane and a plane in the 3-dimensional space.</p>
<p>Our data set will be assumed to consist of <em>N</em> observations characterized by <em>d</em> features or attributes,

$$
x_n = (x_1 \ldots x_d)
$$

for $n = (1 \ldots N)$. The problem of binary classifying these data points can be translated to that of finding a series of weights $w_i$ such that all vectors verifying</p>
<p style="text-align:center;"><img src="https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Csum_%7Bi%3D1%7D%5Ed+w_i+x_i+%3C+b&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;displaystyle &#92;sum_{i=1}^d w_i x_i &lt; b" title="&#92;displaystyle &#92;sum_{i=1}^d w_i x_i &lt; b" class="latex" /></p>
<p>are assigned to one of the classes whereas those verifying</p>
<p style="text-align:center;"><img src="https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Csum_%7Bi%3D1%7D%5Ed+w_i+x_i+%3E+b&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;displaystyle &#92;sum_{i=1}^d w_i x_i &gt; b" title="&#92;displaystyle &#92;sum_{i=1}^d w_i x_i &gt; b" class="latex" /></p>
<p>are assigned to the other, for a given threshold value $b$. If we rename $b = w_0$ and introduce an artificial coordinate $x_0 = 1$ in our vectors ${x}_n$, we can write the perceptron separator formula as</p>
<p style="text-align:center;"><img src="https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+h%28%5Cmathbf%7Bx%7D%29+%3D+%5Cmathrm%7Bsign%7D%5Cleft%28%5Csum_%7Bi%3D0%7D%5Ed+w_i+x_i%5Cright%29+%3D+%5Cmathrm%7Bsign%7D%5Cleft%28+%5Cmathbf%7Bw%7D%5E%7B%5Cmathbf%7BT%7D%7D%5Cmathbf%7Bx%7D%5Cright%29&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;displaystyle h(&#92;mathbf{x}) = &#92;mathrm{sign}&#92;left(&#92;sum_{i=0}^d w_i x_i&#92;right) = &#92;mathrm{sign}&#92;left( &#92;mathbf{w}^{&#92;mathbf{T}}&#92;mathbf{x}&#92;right)" title="&#92;displaystyle h(&#92;mathbf{x}) = &#92;mathrm{sign}&#92;left(&#92;sum_{i=0}^d w_i x_i&#92;right) = &#92;mathrm{sign}&#92;left( &#92;mathbf{w}^{&#92;mathbf{T}}&#92;mathbf{x}&#92;right)" class="latex" /></p>
<p>Note that $w^Tx$ is notation for the <a href="http://en.wikipedia.org/wiki/Scalar_product">scalar product</a> between vectors $w$ and $x$. Thus the problem of classifying is that of finding the vector of weights $w$ given a training data set of <em>N</em> vectors $x$ with their corresponding labeled classification vector $(y_1 \ldots y_N)$

<h3>The perceptron learning algorithm (PLA)</h3>
<p>The learning algorithm for the perceptron is online, meaning that instead of considering the entire data set at the same time, it only looks at one example at a time, processes it and goes on to the next one. The algorithm starts with a guess for the vector <img src="https://s0.wp.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;mathbf{w}" title="&#92;mathbf{w}" class="latex" /> (without loss of generalization one can begin with a vector of zeros). <a href="https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png"><img data-attachment-id="555" data-permalink="https://datasciencelab.wordpress.com/2014/01/10/machine-learning-classics-the-perceptron/perceptron_update/" data-orig-file="https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png?w=830" data-orig-size="289,293" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}" data-image-title="perceptron_update" data-image-description="" data-medium-file="https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png?w=830?w=289" data-large-file="https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png?w=830?w=289" class="alignright size-full wp-image-555" alt="perceptron_update" src="https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png?w=830" srcset="https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png 289w, https://datasciencelab.files.wordpress.com/2014/01/perceptron_update.png?w=148 148w" sizes="(max-width: 289px) 100vw, 289px"   /></a>It then assesses how good of a guess that is by comparing the predicted labels with the actual, correct labels (remember that those are available for the training test, since we are doing supervised learning). As long as there are misclassified points, the algorithm corrects its guess for the weight vector by updating the weights in the correct direction, until all points are correctly classified.</p>
<p>That direction is as follows: given a labeled training data set, if <img src="https://s0.wp.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;mathbf{w}" title="&#92;mathbf{w}" class="latex" /> is the guessed weight vector and <img src="https://s0.wp.com/latex.php?latex=%5Cmathbf%7Bx%7D_n&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;mathbf{x}_n" title="&#92;mathbf{x}_n" class="latex" /> is an incorrectly classified point with <img src="https://s0.wp.com/latex.php?latex=%5Cmathbf%7Bw%7D%5E%7B%5Cmathbf%7BT%7D%7D%5Cmathbf%7Bx%7D_n+%5Cneq+y_n&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;mathbf{w}^{&#92;mathbf{T}}&#92;mathbf{x}_n &#92;neq y_n" title="&#92;mathbf{w}^{&#92;mathbf{T}}&#92;mathbf{x}_n &#92;neq y_n" class="latex" />, then the weight <img src="https://s0.wp.com/latex.php?latex=%5Cmathbf%7Bw%7D&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;mathbf{w}" title="&#92;mathbf{w}" class="latex" /> is updated to <img src="https://s0.wp.com/latex.php?latex=%5Cmathbf%7Bw%7D+%2B+y_n+%5Cmathbf%7Bx%7D_n&#038;bg=ffffff&#038;fg=000000&#038;s=1" alt="&#92;mathbf{w} + y_n &#92;mathbf{x}_n" title="&#92;mathbf{w} + y_n &#92;mathbf{x}_n" class="latex" />. This is illustrated in the plot on the right, taken from <a href="http://www.mblondel.org/journal/2010/10/31/kernel-perceptron-in-python/">this clear article on the perceptron</a>.</p>
<p>A nice feature of the perceptron learning rule is that if there exist a set of weights that solve the problem (i.e. if the data is linearly separable), then the perceptron will find these weights.</p>

### What is deep learning (DL)?
**Deep learning is about building a function estimator.** Historically, people explain deep learning (**DL**) using the neural network.  Nevertheless, deep learning has outgrown this explanation. 

Let’s build a new android named Pieter. Our first task is to teach Pieter to recognize objects. Can the human visual system be replaced by a big function estimator? 
<div class="imgcap">
<img src="images/fc.jpg" style="border:none;width:90%;">
</div>

> Deep learning has many scary looking equations. We will walk through examples to show how it works. Most of them are pretty simple.

A deep network composes of layers of nodes. Our network above has one input layers, two hidden layers and one output layer. We compute the output of each node with:

$$
z_j = \sum_{i} W_{ij} x_{i} + b_{j}
$$

where $ x_{i} $ are the output from the previous layer, $W_{ij}$ are the weights between node $i$ and $j$, and $b_j$ is the bias.  Finally, the node convert $z_j$ to a value between 0 and 1 using a sigmoid function. 

$$
f(z_j) = \frac{1}{1 + e^{-z_j}}
$$

For example, we have a grayscale image with just four pixels (0.1, 0.3, 0.2, 0.1). We assume the weights $W$ and the bias $b$ are (0.3, 0.2, 0.4, 0.3) and -0.8 respectively. The output of the node is:

$$
\begin{split}
z_j & =  0.3 \cdot 0.1 + 0.2\cdot 0.3 + 0.4\cdot0.2 + 0.3\cdot0.1  - 0.8 = -0.6 \\
f(z_j) &=  \frac{1}{1 + e^{0.6}} = 0.35
\end{split}
$$

In this network, we make 3 predictions. $Y_1$ is the probability that the image is a school bus. The other two outputs represent the probability of two other object classes.

$ x_{i} $ is called the **feature** in DL. **A deep network extracts features in the training data to make predictions.** For example, one of the nodes may be trained to detect the yellow color. For a yellow school bus, the activation of that node should be high. The essence of DL is to learn $W$ and $b$ from the training data (**training dataset**). In this exercise, we supply all the weight and bias values to our android Pieter. But as the term “deep learning” implies, Pieter will learn those parameters by himself using the training dataset. We still miss a few pieces for the puzzle, but the deep network diagram and the equations above are the major foundation for the deep learning. 


#### XOR

Can a DL network model a logical function? For the skeptics, we will build an "exclusive or" (a xor b) function using a simple network with $$W$$ and $$b$$ shown as below:
<div class="imgcap">
<img src="images/xor.jpg" style="border:none;width:40%">
</div>
For each node, we apply the same equations mentioned previously:

$$
z_j =  \sum_{i} W_{ij} x_i + b_{j}
$$

$$
h_j = \sigma(z) = \frac{1}{1 + e^{-z_j}}
$$


### Build a Linear Regression Model
**Deep learning acquires its knowledge from training data.** Pieter learns the model parameters $$W$$ by processing training data. For another example, Pieter wants to expand his horizon and start online dating. He wants to know how many dates he will get according to his years of education and monthly income. Let's model it with a simple linear equation:

$$
dates = W_1\times \text{years in school} + W_2 \times \text{monthly income} + b
$$

He surveys 1000 people on their income, education and their number of online dates. The number of dates (answers) in the training dataset are called the **true values** or **true labels**. The task for Pieter is to find the parameters $$W$$ and $$b$$ using the training data collected by the survey.

The steps are:
1. Take a first guess on W and b.
2. Use the model to compute the result for each sample in the training dataset.
3. Find the average error between the computed values and the true values.
4. Compute how fast the error may increase or drop relative to W and b. (**Gradient descent**)
5. Re-adjust W & b accordingly to reduce the average error.

We repeat step 2 to 5 many times until $$W$$ and $$b$$ until the average error is very small for our samples.


### Gradient descent
**Deep learning is about learning how much it costs.** Step 4 and 5 is called the gradient descent in DL. We need a function to measure how good our model is. In DL, it is called the **cost function** or **loss function**. It measures the difference between our model and the real world. Mean square error (MSE) between the true labels and the predicted values is one obvious candidate.

$$
MSE= J(\hat{y}, y, W, b) = \frac{1}{N} \sum_i (\hat{y}_i - y_i)^2
$$

where $$ \hat{y}_i $$ is the model prediction and $$ y_i $$ is the true value for sample $$ i $$. We add all the sample errors and take the average. We can visualize the cost below with x-axis being $$ W_1 $$, y-axis being $$ W_2 $$ and z-axis being the average cost $$J$$. To find our model, we need to find the optimal values for $$ W_1 $$ and $$ W_2 $$ with the lowest cost. In short, what are the model parameters with the lowest error for our samples.The mechanism to find the lowest cost is similar to dropping a marble at a random point $$ (W_1, W_2) $$ and let gravity do its work.

<div class="imgcap">
<img src="images/solution2.png" style="border:none;">
</div>

There are different cost functions with different objectives and solutions. Some functions are easier to optimize and some may be less sensitive to outliers. Nevertheless, **finding a cost function is one critical factor in building a DL.**

> Optimizing $$W$$ means finding the trainable parameters with lowest cost.


### Learning rate

Thinking in 3 dimension is hard.

<div class="imgcap">
<img src="images/solution_2d.jpg" style="border:none;">
</div>

Let's simplify it in 2-D first. Consider a point at (L1, L2) where we cut through the diagram along the blue and orange line, then we plot those curves in a 2D diagram.

<div class="imgcap">
<img src="images/gd.jpg" style="border:none;">
</div>

The x-axis is $$ W $$ and the y-axis is the cost.

**Train a model with gradients**: To lower cost, we move $$ L_1 $$ to the right to find the lowest cost. But by how much? L2 has a smaller gradient than L1. i.e. the cost drop faster alone $$ W_1$$ direction than $$W_2$$. Like dropping a ball at (L1, L2), we expect the ball drops faster in the $$ W_1$$ direction. Therefore, the adjustment for $$ (W_1, W_2) $$ should be proportional to its partial gradient at that point. i.e.

$$
\Delta W_i \propto \frac{\partial J}{\partial W_i} 
$$

$$
\text{ i.e. } \Delta W_1 \propto \frac{\partial J}{\partial W_1} \text{ and } \Delta W_2 \propto \frac{\partial J}{\partial W_2}
$$

Add a ratio value $$\alpha$$, the adjustments to $$W$$ becomes:

$$
\Delta W_i = \alpha \frac{\partial J}{\partial W_i}
$$

$$
W_i = W_i - \Delta W_i
$$

At L1, the gradient is negative and therefore we are moving $$W_1$$ to the right. The variable $$ \alpha $$ is called the **learning rate**.  **A small learning rate learns slowly**: a small learning rate changes $$W$$ slowly and takes more iterations to locate the minimum. However, the gradient is less accurate when we use a larger step. In DL, finding the right value for the learning rate is a trial and error exercise which depends on the problem you are solving. We usually try values ranging from $$1e^{-7}$$ to 1 in logarithmic scale ($$1e^{-7}, 1e^{-6}, 1e^{-5}, \dots, 1) $$. Parameters such as the learning rate are called **hyperparameters** because they need to be tunned.

A large learning rate can have serious problems. It costs $$w$$ to oscillate with increasing cost:
<div class="imgcap">
<img src="images/learning_rate.jpg" style="border:none;">
</div>

Let's start with w = -6 (L1). If the gradient is huge and the learning rate is large, $$w$$ will swing too far to the right (L2) that may even have a larger gradient. Eventually, rather than dropping down slowly to a minimum, $$w$$ oscillates upward to a higher cost. 

**A large learning rate overshoots your target.** Here is an illustration on some real life problems like natural language processing (NLP). When we gradually descend on a slope, we may land in an area with a steep gradient in which bounces $$W$$ all the way back. It is very difficult to find the minimum with a constant learning rate with this kind of cost function shape. Advanced methods to address this problem will be discussed later.

<div class="imgcap">
<img src="images/ping.jpg" style="border:none;">
</div>

#### Naive gradient checking
There are many ways to compute a partial derivative. One naive but important method is using the simple partial derivative definition.

$$
\frac{\partial f}{\partial x} = \frac{f(x+\Delta x_i) - f(x-\Delta x_i) } { 2 \Delta x_{i}} 
$$

Here is a simple code demonstrating the derivative of 
$$
x^2 \text{ at } x = 4
$$

```python
def gradient_check(f, x, h=0.00001):
  grad = (f(x+h) - f(x-h)) / (2*h)
  return grad

f = lambda x: x**2
print(gradient_check(f, 4))
```
We never use this method in production; however, computing partial derivative is tedious and error prone. We often use the naive method to verify a partial derivative implementation during development. 


### Backpropagation
**Backpropagate your loss to adjust W.** To compute the partial derivatives, $$ \frac{\partial J}{\partial W_i} $$, we can start from each node in the left most layer and calculate the gradient until it reaches the rightmost layer. Then, we move to the next layer and start the process again. For a deep network, this is very inefficient. To compute the partial gradient efficiently, we perform a forward pass and compute the gradient by a single backpropagation.

#### Forward pass

The method _forward_ computes the output:

$$
out = W_1 X_1 + W_2 X_{2} + b
$$

```python
def forward(x, W, b):
    # x: input sample (N, 2)
    # W: Weight (2,)
    # b: bias float
    # out: (N,)
    out = x.dot(W) + b        # X * W + b: (N, 2) * (2,) -> (N,)
    return out
```

To compute the mean square loss:

$$
J = \frac{1}{N} \sum_i (out - y_{i})^2
$$

```python
def mean_square_loss(out, y):
    # h: prediction (N,)
    # y: true value (N,)
    N = X.shape[0]            # Find the number of samples
    loss = np.sum(np.square(out - y)) / N   # Compute the mean square error from its true value y
    return loss
```

#### Backpropagation pass

**Find the derivative by backpropagation**. We backpropagate the gradient from the right most layer to the left in one single pass. In the programming code, we often name our backpropagation derivative as:

$$
\frac{\partial J}{\partial out} \text{ as dout}
$$

$$
\frac{\partial out}{\partial var} \text{ as dvar}
$$

<div class="imgcap">
<img src="images/bp.jpg" style="border:none;width:80%">
</div>

> Keep track of the naming of your input and output as well as its **shape** (dimension). This is one great tip when you program DL. (N,) means a 1-D array with N elements. (N,1) means a 2-D array with N rows each containing 1 element. (N, 3, 4) means a 3-D array.

First, compute the first partial derivative $$ \frac{\partial J}{\partial out_i} $$ in the right most layer.

$$
J = \frac{1}{N} \sum_i (out_i - y_i)^2
$$

$$
J_i = \frac{1}{N} (out_i - y_i)^2
$$

$$
\frac{\partial f}{\partial out_i} = \frac{2}{N} (out_i - y_i)
$$

We add a line of code in the mean square loss to compute $$ \frac{\partial J}{\partial out_{i}} $$

```python
def mean_square_loss(h, y):
    # h: prediction (N,)
    # y: true value (N,)
    ...
    dout = 2 * (h-y) / N      # Compute the partial derivative of J relative to out
    return loss, dout
```


**Use the chain rule to backpropagate the gradient.** Now we have
$$
\frac{\partial J}{\partial out_i}
$$
. We apply the simple chain rule in calculus to backpropagate the gradient one more layer to the left. 

$$
\frac{\partial J}{\partial W} = \frac{\partial J}{\partial out} \frac{\partial out}{\partial W}  
$$

$$
\frac{\partial J}{\partial b} = \frac{\partial J}{\partial out} \frac{\partial out}{\partial b}  
$$

We follow our naming convention to name $$ \frac{\partial J}{\partial W} \text{ as dW} $$ and $$ \frac{\partial J}{\partial b} \text{ as db} $$.

<div class="imgcap">
<img src="images/bp3.jpg" style="border:none;width:60%">
</div>

<div class="imgcap">
<img src="images/bp2.jpg" style="border:none;width:60%">
</div>

With output as:

$$
out = W X + b
$$

The  partial derivatives are:

$$
\frac{\partial out}{\partial W}  = X
$$

$$
\frac{\partial out}{\partial b}  = 1
$$

Apply the derivative to the chain rule:

$$
\frac{\partial J}{\partial W} = \frac{\partial J}{\partial out} \frac{\partial out}{\partial W}  = \frac{\partial J}{\partial out} X
$$

$$
\frac{\partial J}{\partial b} = \frac{\partial J}{\partial out} \frac{\partial out}{\partial b}  = \frac{\partial J}{\partial out} 
$$


A lot of mathematical notation is involved, but the code for $$ dW, db $$ is pretty simple.