# Neural Networks

## Perceptron Learning

### The Perceptron of Rosenblatt (1958)

![img1](img/topic6img1.png)

- The weights $ w_j $ model the reinforcement factor
- The threshold function models a decision rule
- The perceptron is a "feed forward system"

Rewriting the threshold as shown above and making it a constant input with a variable weight, we would end up with something like the following:

[!img2](img/topic6img2.png)

By extending both the weight vector by $ w_0 = -\theta $ and the feature vectors by the constant $ x_0 = 1 $, the learning algorithm gets a canonical form.

### Binary Classification Problems

Setting: We have $ X, C, D $. Fit $ D $ using a perceptron $ y() $.

**The PT Algorithm**

Algorithm: Perceptron Training

Input: $ D $ with $ \mathbf{x} \in \mathbb{R}^p, c \in \{0, 1\} $ and $ \mu $, a small positive constant

Output: Weight vector from $ \mathbf{w} \in \mathbb{R}^{p + 1} $ (hypothesis)

![img3](img/topic6img3.png)

Repeat until convergence:

```
t = t + 1
(x, c) = random_select(D)
Model function evaluation
Calculation of indicator for true/false hyperplane side
Calculation of weight correction
Parameter vector update
```

![img4](img/topic6img4.png)

Definition of an (affine) hyperplane $ N = \{\mathbf{x} \mid \vec{\mathbf{n}}^T \mathbf{x} = d\} $
- $ \vec{\mathbf{n}} $ is a normal vector of the hyperplane $ N $
- if $ ||\vec{\mathbf{n}}|| = 1 $ then $ \vec{\mathbf{n}}^T \mathbf{x} = d $ gives the (geometric) distance of a point $ \mathbf{x} $ to $ N $
- if $ sign(\vec{\mathbf{n}}^T \mathbf{x}_1 = d) = sign(\vec{\mathbf{n}}^T \mathbf{x}_2 = d) $, then $ \mathbf{x}_1, \mathbf{x}_2 $are on the same side of the hyperplane

![img5](img/topic6img5.png)

### Example

![img6](img/topic6img6.png)

* The examples are presented to the perceptron
* It computes a value that is intepreted as a calss label

Encoding:

* The encoding of the examples is based on features such as the number of line crossings, most acute angle, longest line etc.
* The class label $ c $ is encoded as a number: examples from $ A $ are encoded with $ 1 $, $ B $ as $ 0 $.

![img7](img/topic6img7.png)

See lecture notes for step-by-step visualization of the PT-Algorithm

### Perceptron Convergence Theorem

Let $ X_0, X_1 $ be two finite sets with vectors of the form $ \mathbf{x} = (1, x_1, \dots, x_p)^T $, let $ X_1 \cap X_0 = \emptyset $ and let $ \hat{\mathbf{w}} $ define a separating hyperplane w.r.t. $ X_0, X_1 $. Moreover, let $ D $ be a set of examples of the form $ (\mathbf{x}, 0), \mathbf{x} \in X_0 $ and $ (\mathbf{x}, 1), \mathbf{x} \in X_1 $.

Then holds:

If the examples in $ D $ are processed with the PT-algorithm, the constructed weight vector $ \mathbf{w} $ will converge withing a finite number of iterations.

See lecture notes for proof.

Given some $ \mathbf{w} $, the PT algorithm checks if the examples $ (\mathbf{x}, c) \in D $ are on the correct hyperplane side and possibly adapts $ \mathbf{w} $ (left). Goal is to find a *separating hyperplane* $ \mathbf{w} $.

![img8](img/topic6img8.png)

If the classes are linearly separable (left), the PT algorithm will converge. If no such hyperplane exists, convergence cannot be guaranteed (right).

### PT Algorithm vs Regression

Given some $ \mathbf{w} $, regression methods will calculate a loss, quantifying the "grade of misclassification", by exploiting both the hyperplane side and the distance, given the examples in $ D $. Goal is to find a min-loss hyperplane $ \mathbf{w} $.

![img9](img/topic6img9.png)

...