# Lecture 19: Linear Models

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import sympy as sy
import utils as utils
from fractions import Fraction

from IPython.display import display, HTML

# Inline plotting
%matplotlib inline

# Make sympy print pretty math expressions
sy.init_printing()

utils.load_custom_styles()

## Nearest Prototype Classifiers

Given a set of $N$ samples, each represented by a vector $\mathbf{x}_i \in \mathbb{R}^D$, and the corresponding labels $l_i$, we can define three classifiers that can be used in order to classify a new (unknown) sample $\mathbf{x}_{*} \in \mathbb{R}^D$:
- Nearest Class Centroid
- Nearest Sub-class Centroid 
- $k$-Nearest Neighbour

Because these methods classify new samples based on nearest mean vector corresponding to a class, they are also known as **Nearest Prototype Classifiers**.

---
### Nearest Class Centroid (NCC)

Neasrest Class Centroid represents each class $c_k$ with the corresponding mean class vector:

<img src="figures/lecture-19/ncc-class-representation.png" width="600" />





Given the class mean vectors $\mu_1, \mu_2, \cdots, \mu_K$, the new sample $\mathbf{x}_{*}$ is classified to the class $c_k$ corresponding to the
smallest distance:

<img src="figures/lecture-19/ncc-classification-of-new-sample.png" width="600" />





The figure below illustrates the mean vectors $\mu_k$ as red points and the red lines represents the decision boundary.

<img src="figures/lecture-19/ncc-classifier-figure.png" width="400" />





<div class="sidenote">
The NCC classifier assumes that each class is unimodal and follows a Normal Distribution.
</div>

---
### Nearest Sub-Class Centroid (NCC)

NSC assumes that each subclass $m$ of class $c_k$ follows a Normal Density (distribution) and is represented by the mean subclass vector $\mu_{km}$:

<img src="figures/lecture-19/nsc-class-representation.png" width="600" />





where
- $q_i$ denotes the subclass label of vector $\mathbf{x}_i$
- $N_{km}$ to denote the number of samples forming subclass $m$ of class $c_k$

Given the class mean vectors, the new sample $\mathbf{x}_{*}$ is classified to the class $c_k$ corresponding to the smallest distance:

<img src="figures/lecture-19/nsc-classification-new-sample.png" width="600" />





In order to define the subclasses of each class, we usually apply a clustering algorithm (e.g. K-Means) using the samples of the corresponding class. The number of subclasses per class is a parameter of the NSC and needs to be decided beforehand.

The figure below shows an example where $m=2$ (two subclass per class). The red dots represents the mean vectors and the red line represents the decision boundary.

<img src="figures/lecture-19/nsc-2-subclass-figure.png" width="400" />







---
### Nearest Neighbour (NN)

In the extreme case where the number of subclasses per class is equal to the number of its samples, then the resulting classifier is called Nearest Neighbor
(NN) classifier. That is, NN classifier classifies $\mathbf{x}_{*}$ to the class of the sample closest to it.

We can also use multile $k$ nearest neighbors for classification. Given $k=3$, we can compute the mean of the any three samples in our dataset and classify a new sample $\mathbf{x}_{*}$ based on the class of the majority.

<img src="figures/lecture-19/kNN.png" width="600" />


<div class="sidenote">
    Notice that $k$ is an odd number because kNN is based on a voting mechanism where it classifies based on a majority vote. If $k$ is even then there may be cases where the votes are equal. If there are no majority then the classification is arbitrary.
</div>

The figure below illustrates what happens to the decision boundary when $k$ is an even number. Play with kNN at http://vision.stanford.edu/teaching/cs231n-demos/knn/


<img src="figures/lecture-19/knn-different-ks.png" width="700" />






---
## The Curse of Dimensionality

When the number of dimensions of the feature space in which the data representations live in is high (comparable, or even higher than the number of samples), the
application of statistical techniques is problematic. This is, roughly, because the
number of parameters is higher than the number of observations, and thus, their
estimation is infeasible. This problem is usually called as curse of dimensionality. For this reason, dimensionality reduction techniques, i.e. techniques that can
find an optimal (with respect to an associated criterion) mapping from the original
feature space $\mathbb{R}^D$ to a low-dimensional feature space $\mathbb{R}^d$ where $d < D$ are important.

---
## Fisher Discriminant Analysis

<img src="figures/lecture-19/figure-pca-vs-lda.png" width="400" />





<img src="figures/lecture-19/linear-projection-formula.png" width="600" />





Equation (4.5) can be thought as follows (illustrated in the figure below):
1. The vector $\mathbf{w}$ is the normal to the line that we are projecting the vectors $\mathbf{x}_i$ onto. Notice how different vectors of $\mathbf{w}$ corresponds to different lines.
2. Computing the dot product between the vector $\mathbf{w}$ to any sample $\mathbf{x}_i$, would result in a value $y_i$ that lies in a line represented by its normal $\mathbf{w}$. Notice that the magnitude of the vector $\mathbf{w}$ does not effect the corresponding line.


<img src="figures/lecture-19/eq-4.5-explain.png" width="300" /> <img src="figures/lecture-19/eq-4.5-figure-2.png" width="300" />







In the figures above, we assume that each class is unimodal and follows a Normal Distribution. This is the assumention of Fisher Discriminant Analysis. The goal with Fisher Discriminant Analysis and Linear Discriminant Analysis is to project our samples into a lower dimensional space (in our case a line) such that discrimination between the classes are highest. In other words, we want to find a line represented by $\mathbf{w}$ so:

1. The distance between mean values of each class in the projected space is as large as possible, and
2. The variance within each class in the projected space is small as possible

The figure below shows an example where both criteria are optimised;

<img src="figures/lecture-19/lda-goal.png" width="500" />





---
### Objectives
We know that if we assume that samples in each class $c_k$ are unimodal and follow a normal distribution, then they are better discriminated when:

1. The distance between mean values of each class in the projected space is as large as possible, and
2. The variance within each class in the projected space is small as possible

We have to come up with an optimisation function expressed in terms of our vector $\mathbf{w}$ and our samples $\mathbf{x}_i$ that maximises objective (1) while minimisng objective (2).


**(1)** The first objective can be expressed as a function of $\mathbf{w}$:

<img src="figures/lecture-19/difference-mean-values.png" width="600" />





where $m_k$ is the projected mean value of class $c_k$ given by:

<img src="figures/lecture-19/mean-value-formula.png" width="600" />





and $\mathbf{S}_b$ is a $D\times D$ matrix called the **between-class scatter matrix**.

<img src="figures/lecture-19/eq.4.15-between-class-scatter.png" width="600" />



**(2)**  The second objective can be expressed as follows.

The formula for computing the variance in the projected space within each class is given as:

<img src="figures/lecture-19/variance-within-projected-class-samples.png" width="600" />





Since we want to minimise the variance of both classes, we just compute the sum of the variances and come up with a function in terms of $\mathbf{w}$ and $\mathbf{x}_i$:

<img src="figures/lecture-19/sum-of-variances.png" width="600" />





where $\mathbf{S}_w$ is $D\times D$ matrix called **with-class scatter matrix**:

<img src="figures/lecture-19/within-scatter-matrix-formula.png" width="600" />





The formula for scatter matrix is given as:

<img src="figures/lecture-19/scatter-matrix-formula.png" width="600" />





---
### Fisher Ratio

Now that we have an expression for both our objectives, we formulate an optimisation problem. Maximising the following optimisation problem called Fisher's Ratio, finds the optimal projection vector $\mathbf{w}$:

<img src="figures/lecture-19/fishers-ratio.png" width="600" />





---
### Solving Fisher's Ratio

The solution of 4.16 is given by solving the generalized eigenanalysis problem:

<img src="figures/lecture-19/fisher-ratio-solution-1.png" width="600" />





If we assume that $\mathbf{S}_w$ is non-singular i.e., $\mathbf{S}_w$ has an inverse then we can rewrite (4.17) to:

<img src="figures/lecture-19/fisher-ratio-solution-2a.png" width="600" />





We can compute the optimal $\mathbf{w}$ as follows:

<img src="figures/lecture-19/fisher-ratio-solution-2b.png" width="600" />





Note that for the two-class case, the rank of $\mathbf{S}_b$ is equal to one because it is calculated as an outer product of one vector. This restricts the number of possible solutions to one.

---
## Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis extends Fisher Linear Discriminant in problems involving more than two classes. In order to define the optimisation problem for LDA we need to redefine the two scatter matrices to account for more than two classes.

---
### Redefine Objective 1

The between-class scatter matrix $\mathbf{S}_b$ becomes:

<img src="figures/lecture-19/lda-between-class-scatter-matrix.png" width="600" />





which is a matrix with rank equal to $K - 1$, restricting the number of possible
solutions to $K - 1$.

---
### Redefine Objective 2

The within-class scatter matrix $\mathbf{S}_w$ becomes:

$$
\mathbf{S}_w = \sum_{k=1}^{K} \mathbf{S}_k
$$

where $\mathbf{S}_k$ is the scatter matrix for class $c_k$ given as:

<img src="figures/lecture-19/scatter-matrix-formula.png" width="600" />





---
### Optimisation Function

We need $K-1$ of the $\mathbf{w}$ vectors to classify $K$ classes. Recall that in Fisher Discriminant Analysis where $K=2$, we only needed one $\mathbf{w}$ vector i.e., $K-1$. We organise all these vectors into an $D\times K-1$ matrix $\mathbf{W}$. 

Finally, we can formulate an optimisation problem with respect to $\mathbf{W}$:

<img src="figures/lecture-19/lda-optimisation-function.png" width="600" />





where $Tr(\dot)$ denotes the trace operator of a matrix. The trace operator of a matrix sums the diagonal entries for a matrix.

<div class="sidenote">
We usually add the contraint $\mathbf{W}^T \mathbf{W} = \mathbf{I}$ because we want the $\mathbf{w}$ vectors to be orthonormal.
</div>

---
### Solving LDA

The solution of Eq. 4.22 is given by sequentially applying generalized eigenanalysis:

<img src="figures/lecture-19/fisher-ratio-solution-1.png" width="600" />


and keeping the eigenvectors corresponding to the maximal $K - 1$ eigenvalues in
order to form the columns of $\mathbf{W}$.

---
### Classification

Once we have obtained the $\mathbf{W}$, we can classify a new sample $\mathbf{x}_{*} \in \mathbb{R}^d$ by the projection:

$$
\mathbf{y}_{*} = \mathbf{W}^T \mathbf{x}_{*}
$$