# Lecture 19: Linear Models

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import sympy as sy
import utils as utils
from fractions import Fraction

from IPython.display import display, HTML

# Inline plotting
%matplotlib inline

# Make sympy print pretty math expressions
sy.init_printing()

utils.load_custom_styles()

## Nearest Prototype Classifiers

Given a set of $N$ samples, each represented by a vector $\mathbf{x}_i \in \mathbb{R}^D$, and the corresponding labels $l_i$, we can define three classifiers that can be used in order to classify a new (unknown) sample $\mathbf{x}_{*} \in \mathbb{R}^D$:
- Nearest Class Centroid
- Nearest Sub-class Centroid 
- $k$-Nearest Neighbour

Because these methods classify new samples based on nearest mean vector corresponding to a class, they are also known as **Nearest Prototype Classifiers**.

---
## Nearest Class Centroid (NCC)

Nearest Class Centroid represents each class $c_k$ with the corresponding mean class vector:

<img src="figures/lecture-19/ncc-class-representation.png" width="600" />





Given the class mean vectors $\mu_1, \mu_2, \cdots, \mu_K$, the new sample $\mathbf{x}_{*}$ is classified to the class $c_k$ corresponding to the
smallest distance:

<img src="figures/lecture-19/ncc-classification-of-new-sample.png" width="600" />





Let us consider Bayes' formula for binary classification problem. If we have 
\begin{align}
P(c_1 \mid \mathbf{x}_{*}) &> P(c_2 \mid \mathbf{x}_{*}) \\
p(\mathbf{x}_{*} \mid c_1) P(c_1) &> p(\mathbf{x}_{*} \mid c_2) P(c_2) \\
\end{align}

## Probability-based Classifier vs Nearest Centroid Classifier

There is a connection between the probability-based classifier and the nearest centroid classifier.

---

First, let us recall how the probability-based classifier works. In the case of binary classification, we can classify a sample $\mathbf{x}$ using Bayes' decition rule:

**Classify $\mathbf{x}$ as $c_1$ if $P(c_1 \mid \mathbf{x}) > P(c_2 \mid \mathbf{x})$; otherwise classify $\mathbf{x}$ as $c_2$**

Bayes' formula is given as:
$$
P(c_k \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid c_k)P(c_k)}{p(\mathbf{x})}
$$

Let us express the inquality in a different way:

\begin{align}
\frac{p(\mathbf{x} \mid c_1)P(c_1)}{p(\mathbf{x})} > 
\frac{p(\mathbf{x} \mid c_2)P(c_2)}{p(\mathbf{x})}
\end{align}


Since $p(\mathbf{x})$ is constant term, we can multiply it on both sides of the inequality and we get 1. 

Using Bayes' formula we can rewrite the decision rule is:

**Classify $\mathbf{x}$ as $c_1$ if $p(\mathbf{x} \mid c_1)P(c_1) > p(\mathbf{x} \mid c_2)P(c_2)$; otherwise classify $\mathbf{x}$ as $c_2$**

---
Next, let us look at how the probabilities are defined.

If we assume that our samples $\mathbf{x} \in \mathbb{R}^D$ follow a normal distribution centered at $\mu$ with covariance $\Sigma$:

$$
p(\mathbf{x}) \sim N(\mu, \Sigma)
$$
then the multivariate normal distribution is given as:

<img src="figures/lecture-19/eq-3.26.png" width="600" />



So if our class-conditional probability follows a normal distribution $p(\mathbf{x} \mid c_k) \sim N(\mu_k, \Sigma_k)$, then we can use the Equation 3.26 to find an expression for the class-conditional probability:
$$
p(\mathbf{x} \mid c_k) = 
 A \exp 
   \left[
     -\frac{1}{2} (\mathbf{x} - \mu_k)^T \Sigma_k^{-1} (\mathbf{x} - \mu_k)  
    \right]
$$
where $A$ is a constant given by:
$$
A = \frac{1}{(2\pi)^{D/2} \lvert \Sigma \rvert^{1/2}}
$$

---

The expression $(\mathbf{x} - \mu_k)^T \Sigma_k^{-1} (\mathbf{x} - \mu_k) $ is the squared Mahalanobis distance between $\mathbf{x}$ and $\mathbf{\mu}_k$ given the covariance matrix of class $k$; $\Sigma_k$

The expression for the squared Euclidean distance can be defined as:
$$
\lVert \mathbf{x} - \mu_k \rVert_2^2 = (\mathbf{x} - \mu_k)^T (\mathbf{x} - \mu_k) 
$$

This implies that when $\Sigma_k$ is the identity matrix, then the Mahalanobis distance degenerates to the Euclidean distance.

---

Let us put everything together. Suppose:

- the covariance matrix for all classes is identity $p(\mathbf{x} \mid c_k) \sim N(\mu_k, \mathbf{I})$
- the prior probability of class $k$ is the same for all classes i.e., all classes have equal probability $P(c_k) = 1/K$

Then the probability-based classifier and nearest centroid classifier are the same. So the take-away is this:

<div class="sidenote">
    When we fix the covariance matrix for all classes to the identity matrix and all classes have equal probability then 
    nearest centroid classifier is a special case of the more genertic probability-based classifier (using the normal distribution). 
<div>

---

The figure below illustrates the mean vectors $\mu_k$ as red points and the red lines represents the decision boundary.

<img src="figures/lecture-19/ncc-classifier-figure.png" width="400" />











---
### Nearest Sub-Class Centroid (NCC)

NSC represents each class $c_k$ by $m$ prototypes $\mathbf{\mu}_{km}$:

<img src="figures/lecture-19/nsc-class-representation.png" width="600" />





where
- $q_i$ denotes the subclass label of vector $\mathbf{x}_i$
- $N_{km}$ to denote the number of samples forming subclass $m$ of class $c_k$

Given the class mean vectors, the new sample $\mathbf{x}_{*}$ is classified to the class $c_k$ corresponding to the smallest distance:

<img src="figures/lecture-19/nsc-classification-new-sample.png" width="600" />





In order to define the subclasses of each class, we usually apply a clustering algorithm (e.g. K-Means) using the samples of the corresponding class. The number of subclasses per class is a parameter of the NSC and needs to be decided beforehand.

<div class="warning">
Interpreting NSC classifier as probability-based classifier is a complicated because we do not know the true mean vectors. 
The mean vectors are found by applying k-means. Since k-means is stochastic, applying k-means multiple times, yields different resuts. In order to interpret the NSC classifier as probability-based classifier, we need to estimate the  mean vectors because the mean vectors of each class is drawn from a probability distribution.
</div>

The figure below shows an example where $m=2$ (two subclass per class). The red dots represents the mean vectors and the red line represents the decision boundary. Notice that the decision boundary is nonlinear.

<img src="figures/lecture-19/nsc-2-subclass-figure.png" width="400" />







---
### Nearest Neighbour (NN)

In the extreme case where the number of subclasses per class is equal to the number of its samples, then the resulting classifier is called Nearest Neighbor
(NN) classifier. That is, NN classifier classifies $\mathbf{x}_{*}$ to the class of the sample closest to it.

The decision function of the NN classifier is non-linear.

We can also use multile $k$ nearest neighbors for classification. Given $k=3$, we can compute the mean of the any three samples in our dataset and classify a new sample $\mathbf{x}_{*}$ based on the class of the majority.

<img src="figures/lecture-19/kNN.png" width="600" />


<div class="sidenote">
    Notice that $k$ is an odd number because kNN is based on a voting mechanism where it classifies based on a majority vote. If $k$ is even then there may be cases where the votes are equal. If there are no majority then the classification is arbitrary.
</div>

The figure below illustrates what happens to the decision boundary when $k$ is an even number. The white empty space between classes illustrate the place where NN cannot determine the classification. Play with kNN at http://vision.stanford.edu/teaching/cs231n-demos/knn/


<img src="figures/lecture-19/knn-different-ks.png" width="700" />






<div class="sidenote">
<strong>What is the best classifier?</strong> 
<p>Well, it depends on the data. However, there is a theorem that proves that given sufficiently large dataet i.e., when the number of samples goes towards infinity then the NN classifier is the best classifier. The idea is that when NN has a very large number of samples, the number of the empty spaces between classes will be very small. Since NN can make complicated jagged decision function, it will be able to outperform any other classifier.</p>

<p>In comparison, other classifiers including neural networks attempt to define decision functions that are smooth. We can think of as they attempt to fill the empty white space between the classes that are present in NN.</p>

</div>

---
## The Curse of Dimensionality

When the number of dimensions of the feature space in which the data representations live in is high (comparable, or even higher than the number of samples), the
application of statistical techniques is problematic. This is, roughly, because the
number of parameters is higher than the number of observations, and thus, their
estimation is infeasible. This problem is usually called as curse of dimensionality. For this reason, dimensionality reduction techniques, i.e. techniques that can
find an optimal (with respect to an associated criterion) mapping from the original
feature space $\mathbb{R}^D$ to a low-dimensional feature space $\mathbb{R}^d$ where $d < D$ are important.

---
## Fisher Discriminant Analysis

<img src="figures/lecture-19/figure-pca-vs-lda.png" width="400" />





<img src="figures/lecture-19/linear-projection-formula.png" width="600" />





Equation (4.5) can be thought as follows:
- The vector $\mathbf{w}$ spans a line in our one-dimensional space that we want to project or map our samples onto. 
- The value $y_i$ represents a scaling factor of the vector $\mathbf{w}$. Therefore, the magnitude of the vector $\mathbf{w}$ does not effect the corresponding line.

The following figures illustrate how different vectors of $\mathbf{w}$ spans different lines (black lines).

<img src="figures/lecture-19/figure1.png" width="300" /> <img src="figures/lecture-19/figure2.png" width="300" />







In the figures above, we assume that each class is unimodal and follows a Normal Distribution. This is the assumention of Fisher Discriminant Analysis. The goal with Fisher Discriminant Analysis and Linear Discriminant Analysis is to project our samples into a lower dimensional space (in our case a line) such that discrimination between the classes are highest. In other words, we want to find a line represented by $\mathbf{w}$ so:

1. The distance between mean values of each class in the projected space is as large as possible, and
2. The variance within each class in the projected space is small as possible

The figure below shows an example where both criteria are optimised;

<img src="figures/lecture-19/lda-goal.png" width="500" />





---
### Objectives
We know that if we assume that samples in each class $c_k$ are unimodal and follow a normal distribution, then they are better discriminated when:

1. The distance between mean values of each class in the projected space is as large as possible, and
2. The variance within each class in the projected space is small as possible

We have to come up with an optimisation function expressed in terms of our vector $\mathbf{w}$ and our samples $\mathbf{x}_i$ that maximises objective (1) while minimisng objective (2).


**(1)** The first objective can be expressed as a function of $\mathbf{w}$:

<img src="figures/lecture-19/difference-mean-values.png" width="600" />





where $m_k$ is the projected mean value of class $c_k$ given by:

<img src="figures/lecture-19/mean-value-formula.png" width="600" />





and $\mathbf{S}_b$ is a $D\times D$ matrix called the **between-class scatter matrix**.

<img src="figures/lecture-19/eq.4.15-between-class-scatter.png" width="600" />



**(2)**  The second objective can be expressed as follows.

The formula for computing the variance in the projected space within each class is given as:

<img src="figures/lecture-19/variance-within-projected-class-samples.png" width="600" />





Since we want to minimise the variance of both classes, we just compute the sum of the variances and come up with a function in terms of $\mathbf{w}$ and $\mathbf{x}_i$:

<img src="figures/lecture-19/sum-of-variances.png" width="600" />





where $\mathbf{S}_w$ is $D\times D$ matrix called **within-class scatter matrix**:

<img src="figures/lecture-19/within-scatter-matrix-formula.png" width="600" />





The formula for scatter matrix is given as:

<img src="figures/lecture-19/scatter-matrix-formula.png" width="600" />





---
### Fisher Ratio

Now that we have an expression for both our objectives, we formulate an optimisation problem. Maximising the following optimisation problem called Fisher's Ratio, finds the optimal projection vector $\mathbf{w}$:

<img src="figures/lecture-19/fishers-ratio.png" width="600" />





---
### Solving Fisher's Ratio

The solution of 4.16 is given by solving the generalized eigenanalysis problem:

<img src="figures/lecture-19/fisher-ratio-solution-1.png" width="600" />





If we assume that $\mathbf{S}_w$ is non-singular i.e., $\mathbf{S}_w$ has an inverse then we can rewrite (4.17) to:

<img src="figures/lecture-19/fisher-ratio-solution-2a.png" width="600" />





If we perform eigendecomposition on $\mathbf{S} = \mathbf{S}_{w}^{-1}\mathbf{S}_{b}$, then the eigenvector corresponding to the maximal eigenvalue corresponds to $\mathbf{w}$.

We can also compute the optimal $\mathbf{w}$ directly using the following formula:

<img src="figures/lecture-19/fisher-ratio-solution-2b.png" width="600" />





The only difference is that $\mathbf{w}$ vector computed by Equation (4.19) may not be a unit vector.

Note that for the two-class case, the rank of $\mathbf{S}_b$ is equal to one because it is calculated as an outer product of one vector. This restricts the number of possible solutions to one.

---
## Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis extends Fisher Linear Discriminant in problems involving more than two classes. In order to define the optimisation problem for LDA we need to redefine the two scatter matrices to account for more than two classes.

---
### Redefine Objective 1

The between-class scatter matrix $\mathbf{S}_b$ becomes:

<img src="figures/lecture-19/lda-between-class-scatter-matrix.png" width="600" />





which is a matrix with rank equal to $K - 1$, restricting the number of possible
solutions to $K - 1$.

---
### Redefine Objective 2

The within-class scatter matrix $\mathbf{S}_w$ becomes:

$$
\mathbf{S}_w = \sum_{k=1}^{K} \mathbf{S}_k
$$

where $\mathbf{S}_k$ is the scatter matrix for class $c_k$ given as:

<img src="figures/lecture-19/scatter-matrix-formula.png" width="600" />





---
### Optimisation Function

We need $K-1$ of the $\mathbf{w}$ vectors to classify $K$ classes. Recall that in Fisher Discriminant Analysis where $K=2$, we only needed one $\mathbf{w}$ vector i.e., $K-1$. We organise all these vectors into an $D\times K-1$ matrix $\mathbf{W}$. 

Finally, we can formulate an optimisation problem with respect to $\mathbf{W}$:

<img src="figures/lecture-19/lda-optimisation-function.png" width="600" />





where $Tr(\dot)$ denotes the trace operator of a matrix. The trace operator of a matrix sums the diagonal entries for a matrix.

<div class="sidenote">
We usually add the contraint $\mathbf{W}^T \mathbf{W} = \mathbf{I}$ because we want the $\mathbf{w}$ vectors to be orthonormal. We want irrelavant information.
</div>

---
#### Alternative Formulation

Alternatively, we can minimise the following optimisation function:

<img src="figures/lecture-19/lda-alternative-optimisation-criterion.png" width="600" />


where $\mathbf{S}_T$ is called the total scatter matrix given by:

<img src="figures/lecture-19/total-scatter-matrix-formula.png" width="600" />


We can show that maximising LDA criterion in Equation (4.22) is equivalent to minimising Equation (4.78) as follows:

<img src="figures/lecture-19/lda-equivalent-formulations.png" width="600" />


---
### Solving LDA

The solution of Eq. 4.22 is given by sequentially applying generalized eigenanalysis:

<img src="figures/lecture-19/fisher-ratio-solution-1.png" width="600" />


and keeping the eigenvectors corresponding to the maximal $K - 1$ eigenvalues in
order to form the columns of $\mathbf{W}$.

---
### Classification

Once we have obtained the $\mathbf{W}$, we can classify a new sample $\mathbf{x}_{*} \in \mathbb{R}^d$ by the projection:

$$
\mathbf{y}_{*} = \mathbf{W}^T \mathbf{x}_{*}
$$