# Lecture 21: Generalised Linear Discriminant Functions

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import sympy as sy
import utils as utils
from fractions import Fraction

from IPython.display import display, HTML

# Inline plotting
%matplotlib inline

# Make sympy print pretty math expressions
sy.init_printing()

utils.load_custom_styles()

## Mimimum Square Error (MSE)

In previous lecture, we looked at linear discriminant functions when the samples are linearly separable. When the samples are linearly separable, we can define an objective function based on misclassified samples $\chi$. This approach does not yield good results when the two classes are not linearly separable.

<img src="figures/lecture-21/minimum-square-error-intro.png" width="600" />














---
### What are the target vectors?

<img src="figures/lecture-21/target-vectors.png" width="600" />










---
### How to define the objective function?

<img src="figures/lecture-21/mse-objective-function.png" width="600" />






<div class="sidenote">
<p>Recall in the linearly separable case, the objective function was defined based on misclassified samples.
This was done because we assumed that the samples are linearly separable. Well, if the samples are linearly
separable then the number of misclassified samples will be zero. If not then the algorithm will not converge.
</p>
    
<p>
Now, we define a different objective function based on least squares problem. Just by defining the objective
function differently, we can approach the problem from a different direction. So whether a classifier
works linearly separable data or not is defined on the objective function that we use to the train the 
classifier.
</p>
    
<p>
If the samples are linearly separable $\mathcal{J}_s$ will be zero, otherwise it will be larger than 0.
</p>
</div>

#### Why do we use the L2 norm in the objective function? 

Because we only want a single error value that expresses the entire objective.

#### Why do take the square of the norm?
- We get an easier expression when we take the derivative. Expression becomes a bit more complicated with square root.
- We get a quadratic function which has a single minimum

---
### How to solve the objective function analytically?

Equation 4.43 looks like the least-squares problem.

<img src="figures/lecture-21/derivative-mse-objective-function.png" width="600" />










Setting the derivative to zero we get:

<img src="figures/lecture-21/mse-derivative-obj-zero.png" width="600" />


The optimal $\mathbf{w}$ can be computed using the above expression!

---
### When is a matrix invertible?

In equation 4.45, use the inverse of $\mathbf{X}\mathbf{X}^T$. A matrix is invertible or non-singular when it has full rank i.e., when its rank equals the lesser of the number of rows and columns. 

<div>
Suppose $\mathbf{A}$ is an $N \times D$ matrix:
<ul>
<li>The rank is a nonnegative integer that cannot be greater than either $N$ or $D$: $rank(A) \leq \min(N, D)$</li>
<li>The rank is the number of columns or row of the matrix $\mathbf{A}$ that are linearly independent.</li>
<li>Only the zero matrix has  $N$ or $D$</li>
<li>If $\mathbf{A}$ is a square matrix $N=D$, then $\mathbf{A}$ is invertible if and only if $\mathbf{A}$ has rank $N$</li>
    <ul>
        <li>A square matrix is full rank if and only if its determinant is nonzero.</li>
    </ul>
<li>If $\mathbf{A}$ is a square matrix $N=D$ and $\mathbf{A}\mathbf{A}^T$ is invertible then $\mathbf{A}$ is also invertible.</li>
<li>For a non-square matrix, it will always be the case that either the rows or columns (whichever is larger in number) are <b>linearly dependent</b></li>

</ul>
</div>

----
### Full rank in Machine Learning

Suppose we have samples $\mathbf{X} \in \mathbb{R}^{N\times D}$. If the number of samples $N$ is higher than the number of dimensions $D$ and the samples are independently and drawn from the same distribution then the matrix $\mathbf{X}$ can be inverted.

We know that $\mathbf{XX}^T$ is non-singular when the samples $\mathbf{x}_i, i = 1, 2, \cdots, N$ are Independent and Identically Distributed (i.i.d)



<div class="sidenote">
<strong>Independent and Identically Distributed Assumption:</strong> The i.i.d. assumption has two parts:
  <ul>
    <li>
        Independent: two random variables are not dependent on each other. Think about it this way; if we know the outcome of one random variable, does it give us information as to the outcome of another?
    </li>
    <li>Identically Distributed: samples come from the same random variable or function. For example, all our samples $\mathbf{x}_i$ may come from the normal distribution with mean $\mu_1$ and covariance matrix $\Sigma_1$ i.e., $\mathbf{x}_i \sim N(\mu_1, \Sigma_1)$
    </li>
  </ul>
</div>

---
### What is pseudo-inverse?

When a matrix is not invertible i.e., when the rank of a matrix is not full, then we can use the pseudo-inverse.

The matrix $\mathbf{X}^{\dagger} = (\mathbf{XX}^T)^{-1}\mathbf{X}$ is called the pseudo-inverse of $\mathbf{X}^T$.

We can make $\mathbf{XX}^T$ invertible by using a regularised version of $\mathbf{X}^{\dagger}$. This means we change the elements in the diagonal of the matrix $\mathbf{XX}^T$ by adding a small scaling version of the identity matrix:

<img src="figures/lecture-21/regularised-version-pseudo-inverse.png" width="600" />



Epsilon $\epsilon$ is a very small value. By adding this value, we make the rows and the columns independent.

<div class="sidenote">
  Note: If $\mathbf{X}\mathbf{X}^T$ is invertible then $\epsilon=0$
</div>

Suppose our matrix $\mathbf{X}$ is an $N \times D$. When we perform the operation $\mathbf{X}\mathbf{X}^T$ then the resulting matrix is $N \times N$. The upperleft submatrix of $D \times D$ has full rank. For the remaining rows and columns, they can be represented as a linear combination of the rest. Adding a small scaled version of the identity matrix, results in a matrix that is invertible because each row or column become independent of one another. Any column cannot be expressed as a linear combination the remaining columns. The rank for the pseudo-matrix is therefore $D$. See the figure below.


<img src="figures/lecture-21/why-epsilon.png" width="400" />






<div class="sidenote">
    Matlab has a function called <code>pinv(A)</code> which works as follows: 
    <ul>
        <li>Compute the inverse of the matrix A by setting epsilon to zero. If A is invertible, then the inverse can be computed.</li>
        <li>If the inverse cannot be computed, it uses a very small value of epsilon like $10^{-15}$</li>
        <li>If that does not work, it increments epsilon with another small value.</li>
    </ul>
</div>

---
### How to solve the objective function iteratively?

Another way to optimize the objective function in Eq. 4.43 is by applying an iterative optimization algorithm. 
Both the analytical approach discussed above and the iterative approach will lead to almost the same $\mathbf{w}$
because the objective function is quadratic which has only one optimum.

<img src="figures/lecture-21/iterative-lms-algorithm.png" width="600" />








<img src="figures/lecture-21/iterative-sample-based-lms-algorithm.png" width="600" />









---
## Non-Linear Decision Functions

Until now, we have used a linear decision function (i.e. a hyperplane in $\mathbb{R}^D$) separating the feature space in two regions:

$$
g(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + w_0
$$

However, most classification problems cannot be solved linearly. How can we extend what we have learnt until now to define nonlinear decision functions?

This can be done in two ways:
1. Augment the decision function i.e., change the form of the decision function
2. Augment the data representation

---
### How to augment the decision function?

We could take the linear decision function and augment it by adding more terms involving the products of pairs of the elements of $\mathbf{x}$:

<img src="figures/lecture-21/non-linear-decision-function.png" width="600" />














$\mathbf{Q}$ is a symmetric matrix i.e., $Q_{dl} = Q_{ld}$ because $x_d x_l = x_l x_d$.

Since $\mathbf{Q}$ is a parameter of the classifier, we can define it to be a symmetric and nonsingular matrix. Why?

**What is the decision function shape in this case?**

The properties of the classifier are then defined by the properties of the matrix:

<img src="figures/lecture-21/tilde-weight.png" width="200" />














- If $\tilde{\mathbf{W}}$ is a positive scaled version of the identity matrix, the decision function corresponds to a hypersphere in $\mathbb{R}^D$

<img src="figures/lecture-21/hypersphere.png" width="250" />












- If $\tilde{\mathbf{W}}$ is positive definite, the decision function corresponds to a hyperellipsoid in $\mathbb{R}^D$

<img src="figures/lecture-21/hyper-ellipsoid-function-3d.png" width="500" />














- If $\tilde{\mathbf{W}}$ is a not positive semi-definite (i.e. some of its eigenvalues are positive and others negative), the decision function corresponds to a hyperboloid in $\mathbb{R}^D$

<img src="figures/lecture-21/hyperboloid.png" width="300" />














<div class="sidenote">
This is not used in practice. Just to show that it is possible to make a non-linear decision function.
</div>

---
### How to augment the data representation?

A second way to obtain a nonlinear decision function is to define nonlinear function $\phi : \mathbb{R}^D \to \mathbb{R}^{M}$ that maps the original data representation $\mathbf{x}$ in $\mathbb{R}^D$ to some other feature space $\mathbb{R}^M$.

Some examples include:

- Example of a mapping from one dimension to two-dimensions: $\phi(x) = [x, x^2]^T$
- Example of a mapping from two dimensions to three-dimensions: $\phi(x_1, x_2) = [x_1, x_2, x_1 x_2]^T$
- Example of a mapping from two dimensions to 4-dimensions: $\phi(x_1, x_2) = [x_1, x_2, x_1 x_2, \sin(x_1 x_2)]^T$

Defining a linear decision function using the mapping $\phi$:
$$
g(\mathbf{x}) = \mathbf{w}^T \phi(\mathbf{x})
$$
corresponds to a nonlinear decision on $\mathbf{x}$.

---
## Binary Classifiers in Multi-Class Classification Problems

Suppose we have a training set $\mathbf{T} = \{ (\mathbf{x_1}, l_1), (\mathbf{x_2}, l_2), \cdots,  (\mathbf{x_N}, l_N) \}$ where each label $l_i \in \{ c_1, c_2, \cdots, c_K \}$ can take one of $K$ different classes.

We can use multiple binary classifiers to solve multi-class classification problems.

Binary decision functions can be combined in order to define multi-class classification schemes in two ways:
- One-versus-One Classification aka. All-Pairs
- One-versus-Rest Classification aka. One-versus-All
- Error Correcting Output Codes (ECOC)

---
### One-versus-One Classification Scheme

In the One-vs-One classification scheme, we use $K(K-1)/2$ classifiers for each pair of classes. For example, in a 3-class classification problem involving classes $c_1$, $c_2$ and $c_3$, we define the following binary decision functions:

$$
g_{12}(\mathbf{x}) \text{ discriminates classes } c_1 \text{ and } c_2 \\
g_{13}(\mathbf{x}) \text{ discriminates classes } c_1 \text{ and } c_3 \\
g_{23}(\mathbf{x}) \text{ discriminates classes } c_2 \text{ and } c_3 \\
$$

---
#### How to define the decision functions?

To define the decision function $g_{kl}(\cdot)$, a subset of the original training set is used, which is formed only by the training vectors belonging to classes $c_k$ and $c_l$.

---
#### How to perform the classification?

Classification is based on **majority voting**. Take the class that got the most votes from all the individual classifiers.

---
#### What are the Advantages and Disadvantages?

- Advantage: The training set for each classifier is smaller, so we get faster training time.
- Advantage: More accurate than one-vs-all because we rely on more than one classifier for the classification
- Disadvantage: For large $K$, we have trained more binary classifiers, in the order of $O(K^2)$. This becomes expensive during classification.

---
### One-versus-Rest Classification Scheme

Train a classifier for each class so we end up having $K$ different binary classifiers. For example, in a 3-class classification problem involving classes $c_1$, $c_2$ and $c_3$, we define the following binary decision functions:

$$
g_{1}(\mathbf{x}) \text{ discriminates class } c_1 \text{ from } c_2 \text{ and } c_3 \\
g_{2}(\mathbf{x}) \text{ discriminates class } c_2 \text{ from } c_1 \text{ and } c_3 \\
g_{3}(\mathbf{x}) \text{ discriminates class } c_3 \text{ from } c_1 \text{ and } c_2 \\
$$

---
#### How to define the decision functions?

To define the decision function $g_{k}(\cdot)$, the entire training set is used where the samples belonging to class $c_k$ form the positive class, while the samples belonging to the other classes form the negative class (a new class).

---
#### How to perform the classification?

When we want to classify a new sample $\mathbf{x}_{*}$, that sample is given to all the binary classifiers. 
- Compute the response of each decision function given the new sample. Each classifier gives a response (e.g. the distance from the decision hyperplane).
- Classifiy based on the highest response: The new sample $\mathbf{x}_{*}$ is assigned the class $c_k$ corresponding to the decision function with the highest response. 

---
#### What are the Advantages and Disadvantages?

- Advantage: Better classification time for large $K$ because we only have $K$ binary classifiers to evaluate.
- Disadvantage: If one classifier is wrong (e.g. by giving the highest response), then the classification may be wrong

---
### Error Correcting Output Codes

Error Correcting Output Codes (ECOC) is framework that treats some base learners as noisy channels and uses ECC to correct the prediction errors made by the learners.

---
#### How to train the classifiers?

For each bit, we train a classifier.

---
#### How to classify?

Pick the closes row of the coding matrix as a prediction.

---
#### What are the Advantages and Disadvantages?

- Advantage: Much faster if $K$ is large because we only need a logairthmic number of bits compared to the number of classes
- Disadvantage: Underlying classifiers may perform poorly