## 3a. Linear classifiers, perceptron

### 1. Perceptron
<img width=30% src="images/3a-1.png">
- Input & Output: real values (Floating)
- This model was originally motivated by biology, with $w_{i}$ being the *synaptic weights*, and $x_{i}$ being the *firing rates*
- Wrap the function $f(x)$ above with **activation function** $\sigma$ to get simpler output:

$$
\sigma = 
\begin{cases}
1 \, \text{ if } x \geq 0 \\
-1 \, \text{ otherwise }
\end{cases} \rightarrow g(x) = \sigma\,(\,f(x))
$$

- Represent function $g(x) = \sigma\,(\,f(x))$ using tensor operations:
<img width=40% src="images/3a-2.png">
- Perceptron algorithm to train classification:

> 1. Start with $w^{0}$ = 0,
> 2. while $\exists n_{k}$ s.t. $y_{n_{k}} (w^{k} \cdot x_{n_{k}}) \leq 0$ ($k$ iteration)
---
```python
def train_perceptron(x, y, nb_epochs_max):
    w = Tensor(x.size(1)).zero_()

    for e in range(nb_epochs_max):
        nb_changes = 0
        for i in range(x.size(0)):
            if x[i].dot(w) * y[i] <= 0:
                w = w + y[i] * x[i]
                nb_changes += 1
        if nb_changes == 0:
            break
    
    return w
```

### 2. Linear Discriminant Analysis (LDA) algorithm:
- Sigmoid function:
    $$\sigma(x) = \frac{1}{2}$$
- LDA Model:
    $$f(x;\, w,\, b) = \sigma\,(w \cdot x + b)$$
(very similiar to the perceptron)
- Consequence:
    $$1 - \sigma(x) = 1 - \frac{1}{1 + e^{-x}} = \sigma(-x)$$

### 3. Multi-dimensional output:
- We can combine multiple liner predictors into a "layer" that takes several inputs and produces serveral outputs:
    $$\forall i = 1,..., M, y_{i} = \sigma (\sum_{j = 1}^N w_{i, j} x_{j} + b_{i})$$
where $b_{i}$ is the "bias" of the $i$-th unit, and $w_{i, 1},..., w_{i, N}$ are its weights
- With $M = 2$ and $N = 3$, we have:
<img width=60% src="images/3a-3.png">


### 4. Limitations of Linear predictors
- Lack of capacity; for classification, the population must be **linearly separable**
- The example below can't be classified by normal approach:
<img width=30% src="images/3a-4.png">
$\rightarrow$ Solution: pre-processing the data to make the two populations linearly separable (by a plane)
<img width=30% src="images/3a-5.png">
$\rightarrow$ Model: $f(x) = \sigma\,(w\cdot\phi(x) + b)$
- Bias-variance tradeoff:
$$
E((Y - y)^{2})  = (E(Y) - y)^{2} + V(Y) = Bias + Variance
$$
$\rightarrow$ Right class of models reduces the bias more and increasees the variance less