# 3. Linear Models

## Logistic Regression

### Binary Classification Problems

Setting:
* $ X $ is a multiset of feature vectors from an inner product space $ \mathbf{X}, \mathbf{X} \in \mathbb{R} $
* $ C = \{0, 1\} $ is a set of two classes
* $ D = \{(\mathbf{x}_1, c_1), \dots, (\mathbf{x}_n, c_n)\} \subseteq X \times C $ is a multiset of examples

Learning task:
* Fit $ D $ using a logistic function $ y() $.

Examples for binary classification problems:
* E-Mail is spam or ham?
* Patient infected or healthy?
* Customer creditworthy or not?

### Linear Regression

![img1](img/topic3img1.png)

* Linear Regression: $ y(\mathbf{x}) = \mathbf{w}^T \mathbf{x} $
* Classification: Predict "spam" if $ y(\mathbf{x}) \geq 0 $ else "ham"

Restrict the range of $ y(\mathbf{x}) $ to reflect the two-class classification semantics:

$ -1 \leq y(\mathbf{x}) \leq 1 $ or $ 0 \leq y(\mathbf{x}) \leq 1 $ 

### Sigmoid (Logistic) Function

$ \sigma(z) = \frac{1}{1 + e^{-z}} $

Linear Regression $ \circ $ Sigmoid Function $ \rightarrow $ Logistic Model Function

$ \mathbf{w}^T \mathbf{x} \circ \frac{1}{1 + e^{-z}} \rightarrow y(\mathbf{x}) \equiv \sigma(\mathbf{w}^T \mathbf{x}) = \frac{1}{1 + e^{\mathbf{w}^T \mathbf{x}}} $

$ y: \mathbb{R}^{p + 1} \rightarrow (0; 1) $

This is interpreted as the estimated probability for th event $ \boldsymbol{\mathsf{C}} = 1 $:
* $ y(\mathbf{x}) = P(\boldsymbol{\mathsf{C}}=1 \mid \boldsymbol{\mathsf{X}}=\mathbf{x}; \mathbf{w}) =: p(1 \mid \mathbf{x}; \mathbf{w}) $ "Probability for C=1 given x, parameterized w"
* * $ 1- y(\mathbf{x}) = P(\boldsymbol{\mathsf{C}}=0 \mid \boldsymbol{\mathsf{X}}=\mathbf{x}; \mathbf{w}) =: p(0 \mid \mathbf{x}; \mathbf{w}) $ "Probability for C=0 given x, parameterized w"

Example (email spam classification):

\begin{equation*}
\begin{split}
\mathbf{x} = 
\begin{pmatrix}
x_0 \\ 
x_1
\end{pmatrix}
\begin{pmatrix}
1 \\ 
|\text{obscene words}|
\end{pmatrix},
\mathbf{x}_1 = 
\begin{pmatrix}
1 \\
5
\end{pmatrix}
\text{ and }
y(\mathbf{x}_1) = 0.67
\end{split}
\end{equation*}

$ \Rightarrow $ 67% chance that this email is spam.


Recap: **Linear Regression for classification**
![img2](img/topic3img2.png)

Recap: **Logistic Regression for classification**
![img2](img/topic3img3.png)

### The BGD Algorithm

Algorithm: Batch Gradient Descent

Input: 
- $ D $ (multiset of examples $ (\mathbf{x}, c) $ with $ x \in \mathbb{R}^p, c \in \{0, 1\} $)
- $ \eta $ Learning rate, small positive constant

Output:

$ \mathbf{w} $ weight vector from $ \mathbb{R}^{p + 1} $ (= hypothesis)

![img4](img/topic3img4.png)


(Repeat until convergence):

`FOREACH (x, c) in D DO:`
- [Model Function evaluation]
- [Calculation of residual]
- [Calculation of derivative of the loss, accumulate for D]
`ENDDO`
- Parameter Vector update = one gradient step down

![img5](img/topic3img5.png)

More complex polynomials will entail more conplex decision boundaries (see lecture notes)

...

## Loss Computation in Detail

2nd part of ML stack: "Optimization Objective"
* Objective: Minimize Loss
* Regularization: None
* Loss: 0/1 loss, squared loss, logistic loss, cross-entropy loss, hinge loss

* The pointwise loss $ l(c, y(\mathbf{x})) $ quantifies the error introduced by some $ \mathbf{x} $. The loss depends on the hypothesis $ y() $ and the true class $ c $ of $ \mathbf{x} $.
* For $ y(\mathbf{x}) = \mathbf{w}^T \mathbf{x} $ we define the following pointwise loss functions
  - 0/1 loss : $ l_{0/1}(c, y(\mathbf{x})) = I_{\neq}(c, \text{sign}(y(\mathbf{x}))) $ which is zero if $ c = \text{sign}(y(\mathbf{x})) $ and 1 otherwise
  - Squared loss : $ l_2(c, y(\mathbf{x})) = (c - y(\mathbf{x}))^2 $

![img6](img/topic3img6.png)

* For $ y(\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T \mathbf{x}}} $ we define the following pointwise loss functions:
  - 0/1 loss : $ l_{0/1}(c, y(\mathbf{x})) = I_{\neq}(c, \lfloor y(\mathbf{x}) + 0.5 \rfloor) $
  - Logistic loss : $ l_{\sigma}(c, y(\mathbf{x})) = -log(y(\mathbf{x})) $ if $ c = 1 $, $ -log(1 - y(\mathbf{x})) $ if $ c = 0$

![img7](img/topic3img7.png)

## Overfitting and Regularization

### Overfitting
let $ D $ be a multiset of examples and let $ H $ be a hypothesis space. The hypothesis $ h_2 \in H $ is considered to overfit $ D $ if an $ h_1 \in H $ exists with the following properties:

$ \text{Err}(h_2, D) < \text{Err}(h_1, D) $ and $ \text{Err}^*(h_1) < \text{Err}^*(h_2) $

Where $ \text{Err}^*(h) $ denotes the true misclassification rate of $ h $ and $ \text{Err}(h, D) $ denotes the error of $ h $ on $ D $.

Reasons for overfitting are often rooted in the example set $ D $:
* $ D $ is noisy
* $ D $ is biased
* $ D $ is too small

![img8](img/topic3img8.png)

Let $ D_{test} $ be a set of test samples. If $ D = D_{tr} \cup D_{test} $ is representative of the real-world population in $ X $, then the quadratic model function $ y(x) = w_0 + w_1 \cdot x + w_2 \cdot x^2 $ is the closest match. 

![img9](img/topic3img9.png)

Moreover, let $ D_{tr} $ and $ D_{test} $ be training and test sets of $ D $, and $ \text{Err}(h, D_{test}) $ be an estimate for $ \text{Err}^*(D_{test}) $ (holdout estimation). The hypothesis $ h_2 $ is considered to overfit $ D $ if an $ h_1 \in H $ exists with the following property:

$ \text{Err}(h_2, D_{tr}) < \text{Err}(h_1, D_{tr}) $ and $ \text{Err}(h_1, D_{test}) < \text{Err}(h_2, D_{test}) $ 

In particular: $ \text{Err}(h_2, D_{test}) >> \text{Err}(h_1, D_{tr}) $

### Mitigation strategies

How to detect overfitting
* Visual inspection (apply projection or embedding for dimensionalities $ p > 3 $)
* Validation (Given a test set, the difference $ \text{Err}(y(), D_{test}) - \text{Err}(y(), D_{tr}) $ is too large)

How to tackle overfitting
* Increase quantity and/or quality of the training data $ D $
* Early stopping of the optimization (training) process
* Regularization (increase model bias by constraining the hypothesis space)
  - Model function (consider functions of lower compelexity)
  - Hypothesis $ \mathbf{w} $: Bound the absolute values of the weights in w of a model function


### Regularization

### Bound the absolute values of the weights $ \mathbf{w} $

Principle: Add to the loss function (term) a regularization function (term), $ R(\mathbf{w}) $:

$ \mathcal{L}(\mathbf{w}) = L(\mathbf{w}) + \lambda \cdot R(\mathbf{w}) $,

Where $ \lambda \geq 0 $ controls the impact of $ R(\mathbf{w}), R(\mathbf{w}) \geq 0 $

![img10](img/topic3img10.png)

![img11](img/topic3img11.png)

Observations:
* Model complexity depends (also) on the magnitude of weights $ \mathbf{w} $
* Minimizing $ L(\mathbf{w}) $ sets no bounds on the weights $ \mathbf{w} $
* Regularization is achieved with "counterweight" $ \lambda \cdot R(\mathbf{w}) $ that grows with $ \mathbf{w} $
* Aside from $ \lambda $ no additional hyperparameter is introduced

### The Vector Norm as Regularization Function

* Ridge Regression
* Lasso Regression

![img12](img/topic3img12.png)

![img13](img/topic3img13.png)

...

### Regularized Linear Regression

* Given $ \mathbf{x} $, predict a real-valued output under a linear model function:

$ y(\mathbf{x}) = w_0 + \sum\limits_{j = 1}^{p} w_j \cdot x_j $

* Vector notation with $ x_0 = 1 $ and $ \mathbf{w} = (w_0, w_1, \dots, w_p)^T $:

$ y(\mathbf{x}) = \mathbf{w}^T \mathbf{x} $

* Given $ \mathbf{x}_1, \dots, \mathbf{x}_n $, assess goodness of fit of the objective function:

$ \mathcal{L}(\mathbf{w}) = \text{RSS}(\mathbf{w}) + \lambda \cdot R_{||\vec{\mathbf{w}}||_2^2}(\mathbf{w}) $

$ = \sum\limits_{i = 1}^{n} (y_i - \mathbf{w}^T \mathbf{x}_i)^2 + \lambda \cdot \vec{\mathbf{w}}^T \mathbf{w} $

...

## Gradient Descent

In the machine learning stack, gradient descent is part of the "Optimization approach" category.

It is a first-order iterative optimization algorithm for finding a local extremum of a differentiable function $ f $.

In our algorithms, $ f $ is the global loss function $ L $, or some other objective function $ \mathcal{L} $.

* The gradient $ \nabla f $ of a differentiable function of several variables is a vector whose compenents are partial derivatives of $ f $.
* The gradient of a function is the direction of steepest ascent or descent.
* Gradient *ascent* means stepping in the direction of the gradient.
* Likewise, *descent* steps in the opposite direction of the gradient, meaning it will find the local minimum of the function.


### Linear Regression + Squared Loss

![img14](img/topic3img14.png)

Update of weight vector $ \mathbf{w} $:

$ \mathbf{w} = \mathbf{w} + \Delta \mathbf{w} $,

using the gradient of the loss function $ L_2(\mathbf {w}) $ to get the steepest descent:

$ \Delta \mathbf{w} = -\eta \cdot \nabla L_2(\mathbf{w}) $

...

### The BGD Algorithm

Algorithm: Batch Gradient Descent

Input: 
* $ D $ multiset of examples $ (\mathbf{w}, c) $ with $ \mathbf{x} \in \mathbb{R}^p, c \in \{-1, 1\} $
* $ \eta $ learning rate, small positive constant

Output:
$ \mathbf{w} $ Weight vector from $ \mathbb{R}^{p + 1} $ (hypothesis)

![img15](img/topic3img15.png)

Repeat until convergence:
`FOREACH (x, c) in D DO`
* Model function evaluation
* Calculation of residual
* Calculation of derivate for loss, accumulate for $ D $
`ENDDO`
* Parameter vector update = one gradient step down

The weight adaptation of the BGD algorithm computes in each iteration the global loss, i.e. the loss of *all* examples in $ D $ ("batch gradient descent").

The (squared) loss with regard to a single example (also called pointwise loss):

$ l_2(c, y(\mathbf{x})) = \frac{1}{2}(c - \mathbf{w}^T \mathbf{x})^2 $

The respective weight adaptation computes canonically as follows:

$ \Delta \mathbf{w} = \eta \cdot (c - \mathbf{w}^T \mathbf{x})^2 \cdot \mathbf{x} $

###  The IGD Algorithm

Algorithm: Incremental Gradient Descent

Input + Output are the same as the BGD.

![img16](img/topic3img16.png)

Repeat until convergence:
`FOREACH (x, c) in D DO:`
* Model function evaluation
* Calculation of residual
* Calculation of derivative
* Parameter vector update = one gradient step down
`ENDDO`


### Linear Regression + 0/1 Loss

![img17](img/topic3img17.png)

Since $ L_{0/1}(\mathbf{w}) $ is not a differentiable function, the gradient descent method cannot be applied to determine its minimum.

### Logistic Regression + Logistic Loss + Regularization

![img18](img/topic3img18.png)

...