# Classification with Kernel Fisher Discriminant

###### COMP4670/8600 - Introduction to Statistical Machine Learning - Assignment 1 (due: Monday, 20 April, 23:59)

Name:

Student ID:

## Instructions

|             |Notes|
|:------------|:--|
|Maximum marks| 20|
|Weight|20% of final grade|
|Format| Complete this ipython notebook. Do not forget to fill in your name and student ID above|
|Submission mode| Use [wattle](https://wattle.anu.edu.au/)|
|Formulas| All formulas which you derive need to be explained unless you use very common mathematical facts. Picture yourself as explaining your arguments to somebody who is just learning about your assignment. With other words, do not assume that the person marking your assignment knows all the background and therefore you can just write down the formulas without any explanation. It is your task to convince the reader that you know what you are doing when you derive an argument.|
| Code quality | Python code should be well structured, use meaningful identifiers for variables and subroutines, and provide sufficient comments. Please refer to the examples given in the tutorials. |
| Code efficiency | An efficient implementation of an algorithm uses fast subroutines provided by the language or additional libraries. For the purpose of implementing Machine Learning algorithms in this course, that means using the appropriate data structures provided by Python and in numpy/scipy (e.g. Linear Algebra and random generators). |
| Late penalty | For every day (starts at midnight) after the deadline of an assignment, the mark will be reduced by 5%. No assignments shall be accepted if it is later than 10 days. | 
| Coorperation | All assignments must be done individually. Cheating and plagiarism will be dealt with in accordance with University procedures (please see the ANU policies on [Academic Honesty and Plagiarism](http://academichonesty.anu.edu.au)). Hence, for example, code for programming assignments must not be developed in groups, nor should code be shared. You are encouraged to broadly discuss ideas, approaches and techniques with a few other students, but not at a level of detail where specific solutions or implementation issues are described by anyone. If you choose to consult with other students, you will include the names of your discussion partners for each solution. If you have any questions on this, please ask the lecturer before you act. |
| Solution | To be presented in the tutorials. |

$\newcommand{\dotprod}[2]{\left\langle #1, #2 \right\rangle}$
$\newcommand{\onevec}{\mathbb{1}}$

Setting up the environment

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

## The data set

We will use census data from the [Australian Bureau of Statistics](http://www.abs.gov.au/). The table, which you can [download here](https://sml.forge.nicta.com.au/isml15/data/census_abs2011_summary.csv), contains a set of median and mean values for different regions of Australia.
*(optional information, for interest only) The data consists of values from 2011 of Statistical Areas Level 3 (SA3s) which are often functional areas of regional citeis and large urban transport and service hubs. In general they have populations between 30,000 and 130,000 persons. For details, look at the [source at ABS](http://www.abs.gov.au/websitedbs/censushome.nsf/home/data?opendocument&navpos=200).*

The following code reads in the data using pandas.

In [None]:
raw_data = pd.read_csv('census_abs2011_summary.csv')
print(raw_data.shape)
raw_data.head()

## (2 points) Plot some characteristics of the data

Display the following two summaries of the data. Please label all the axes, and title the plots appropriately.
* Plot the median monthly mortgage repayment (horizontal axis) to the median weekly rent (vertical axis)
* Plot the histogram of the median age. Use bin boundaries [0,20,30,35,40,45,50,60,80]

In [None]:
# Solution goes here

The binary classification task under consideration is to predict from the other features, whether the median age in a region is 38 or older.

In [None]:
labelvec = np.array(raw_data['Median_age_of_persons_Census_year_2011'])
y = np.ones(len(labelvec))
neg = labelvec < 38
y[neg] = -1
num_pos = len(np.flatnonzero(y > 0))
num_neg = len(np.flatnonzero(y < 0))
print('Number of positive/negative examples = %d/%d' % (num_pos, num_neg))

headers = list(raw_data.columns.values)
headers.remove('Median_age_of_persons_Census_year_2011')
raw_feat = np.array(raw_data[headers])
avg = np.mean(raw_feat, axis=0)
std_dev = np.std(raw_feat, axis=0)
X = (raw_feat-avg)/std_dev
X.shape

## (3 points) Classification via Fisher's Linear Discriminant

Consider the problem of binary classification. Fisher's criterion is given by:
$$
J(w) = \frac{w^T S_B w}{w^T S_W w}
$$
where $S_B$ is the between class covariance and $S_W$ is the within class covariance.

When implementing this as a linear classifier, you need to choose a threshold in projection space. Describe your criteria for choosing a threshold, and justify why it is a good one.

### Solution description


Implement two functions ```train_fld``` and ```predict_fld``` that corresponds to training and prediction. The functions should be used as in the following cell.

In [None]:
# Solution goes here

In [None]:
def confusionMatrix(labels_test, labels_predicted):
    """Compute the matrix of predictions versus labels"""
    if len(labels_test) != len(labels_predicted):
        return 0
    TP = 0; FP = 0; TN = 0; FN = 0
    for i in range(0, len(labels_test)):
        if labels_test[i] == 0 or labels_predicted[i] == 0:
            return 0
        if labels_test[i] > 0:
            if labels_predicted[i] > 0: TP += 1
            else: FN +=1
        else:
            if labels_predicted[i] > 0: FP += 1
            else: TN += 1
    return (TP, TN, FP, FN)

def accuracy(output, labels_test):
    """How many correct predictions?"""
    TP, TN, FP, FN = confusionMatrix(labels_test, np.sign(output))
    return float(TP + TN) / (TP + TN + FP + FN)


In [None]:
w,c = train_fld(X,y)
pred = predict_fld(w,c,X)
print(accuracy(pred[0,:],y))

## (4 points) Kernels

Recall and write down the definitions corresponding to the Gaussian kernel and the inhomogenous polynomial kernel. Please be precise about the meaning of each symbol *(1 point)*.

### Solution

Implement two functions corresponding to the kernels above *(2 points)*.

In [None]:
# Solution goes here

Compute the kernel matrix on all the examples above, and plot the eigenvalues of the kernel matrix. Show results for the Gaussian kernel with width $\sigma=1.1$ and the polynomial kernel with $c=1$ of order 2 and 3 *(1 point)*.

## (4 points) Normalisation in using kernels

You have seen the importance of normalisation in the tutorials.

In the following, we use the fact that kernels ($k(\cdot, \cdot)$) are inner products in a feature space with feature mapping $\phi(\cdot)$:
$$k(x,y) = \dotprod{\phi(x)}{\phi(y)}$$

### Centering

Centering causes the mean of the data set to be the zero vector in feature space. The following is a derivation for doing the centering directly using kernels.

$$
\mu = \frac{1}{n}\sum_{i=1}^n \phi(x_i)
$$
then
$$
\hat{\phi}(x) = \phi(x) - \mu.
$$
Hence
\begin{align*}
\hat{k}(x,y) &= \dotprod{\hat{\phi}(x)}{\hat{\phi}(y)}\\
    &= \dotprod{\phi(x) - \mu}{\phi(y) - \mu}
\end{align*}

Justify and explain the above steps.

### Solution description


### Unit diagonal

It is often convenient to have all the examples to be represented by vectors of the same length. This implies that the diagonal of the kernel matrix (the squared length) is the same for all examples. We arbitrarily (without loss of generality) set this length to 1.
\begin{align*}
\hat{k}(x,y) &= \dotprod{\frac{\phi(x)}{\|\phi(x)\|}}{\frac{\phi(y)}{\|\phi(y)\|}}\\
    &= \frac{1}{\|\phi(x)\|\|\phi(y)\|}\dotprod{\phi(x)}{\phi(y)}\\
    &= \frac{1}{\|\phi(x)\|\|\phi(y)\|} k(x,y)
\end{align*}

Normalizing the kernel matrix such that it has one along the diagonal is sometimes called trace normalisation or spherical normalisation.

Justify and explain each step of the derivation above.


### Solution description


## (3 points) Kernelising Fisher's Discriminant

### Definition
Consider a binary classification task.
Recall from the lecture that Fisher's criterion is given by:
$$
J(w) = \frac{(m_2 - m_1)^2}{s_1^2 + s_2^2}
$$
where $m_1$ and $m_2$ are the means of class $\mathcal{C}_1$ and $\mathcal{C}_2$ respectively, and $s_1^2$ and $s_2^2$ are the corresponding within class variances.

Write down the definition of $m_1, m_2, s_1, s_2$ in terms of the examples $\mathbf{x}$ and the labels $y$. Please define all your symbols carefully, in particular $\mathbf{w}$.

### Solution

### Matrix form

Observe that you can express the sum of a set of vectors as a product of a vector of ones, $\onevec$ and a matrix containing the vectors. Define matrices $\mathbf{X}_1$ and $\mathbf{X}_2$ as the data corresponding to class $\mathcal{C}_1$ and $\mathcal{C}_2$ respectively. Please specify the dimensions carefully. Using these definitions, derive the expression for the numerator of $J(w)$, i.e. $(m_2 - m_1)^2$ where the data only appears in terms of the matrices $\mathbf{X}_1\mathbf{X}_1^T$, $\mathbf{X}_1\mathbf{X}_2^T$, $\mathbf{X}_2\mathbf{X}_1^T$ and $\mathbf{X}_2\mathbf{X}_2^T$. There could be other vectors, for example $\mathbf{w}$.

### Solution

### Kernel form

The above matrix forms $\mathbf{X}_1\mathbf{X}_1^T$, $\mathbf{X}_1\mathbf{X}_2^T$, $\mathbf{X}_2\mathbf{X}_1^T$ and $\mathbf{X}_2\mathbf{X}_2^T$ can be considered to be the special case of the linear kernel, i.e. $\phi(x) = x$. Observe that the numerator of Fisher's criterion can be expressed purely as inner products between examples. It turns out that the denominator can also be expressed purely as inner products.


Implement Kernel Fisher's Discriminant such that it can take a general kernel function. This is the function of the form:
$$
J(\alpha) = \frac{\alpha^T M \alpha}{\alpha^T N \alpha}
$$
where $M$ and $N$ are defined on the following [Wikipedia entry](http://en.wikipedia.org/wiki/Kernel_Fisher_discriminant_analysis).

In [None]:
# Solution goes here

## (4 points) Comparing performance for different kernels

Use half of the available data for training the model. The rest of the data is allocated to the test set. Repeat the experiment 10 times for different random splits of the data. Report the balanced accuracy for the test sets and plot a boxplot comparing the performance of the five kernel functions: Gaussian kernel with width $\sigma=\{0.23, 1.1, 8.7\}$ and the polynomial kernel with $c=1$ of order 2 and 3. Do not forget to label the graph appropriately.

If you were unable to solve the previous questions on kernelising Fisher's Discriminant and normalisation, report results using Linear Fisher's Discriminant on the 7 features above, and also the polynomial basis of degree 2 from Tutorial 2.

In [None]:
# Solution goes here