# Classification with Logistic Regression, Bayes Linear Regression

###### COMP4670/8600 - Introduction to Statistical Machine Learning - Assignment 1 (due: Monday, 18 April, 23:59)

Name: Suraj Narayanan Sasikumar

Student ID: u5881495

## Instructions

|             |Notes|
|:------------|:--|
|Maximum marks| 20|
|Weight|20% of final grade|
|Format| Complete this ipython notebook. Do not forget to fill in your name and student ID above|
|Submission mode| Use [wattle](https://wattle.anu.edu.au/)|
|Formulas| All formulas which you derive need to be explained unless you use very common mathematical facts. Picture yourself as explaining your arguments to somebody who is just learning about your assignment. With other words, do not assume that the person marking your assignment knows all the background and therefore you can just write down the formulas without any explanation. It is your task to convince the reader that you know what you are doing when you derive an argument. Typeset all formulas in $\LaTeX$.|
| Code quality | Python code should be well structured, use meaningful identifiers for variables and subroutines, and provide sufficient comments. Please refer to the examples given in the tutorials. |
| Code efficiency | An efficient implementation of an algorithm uses fast subroutines provided by the language or additional libraries. For the purpose of implementing Machine Learning algorithms in this course, that means using the appropriate data structures provided by Python and in numpy/scipy (e.g. Linear Algebra and random generators). |
| Late penalty | For every day (starts at midnight) after the deadline of an assignment, the mark will be reduced by 5%. No assignments shall be accepted if it is later than 10 days. | 
| Coorperation | All assignments must be done individually. Cheating and plagiarism will be dealt with in accordance with University procedures (please see the ANU policies on [Academic Honesty and Plagiarism](http://academichonesty.anu.edu.au)). Hence, for example, code for programming assignments must not be developed in groups, nor should code be shared. You are encouraged to broadly discuss ideas, approaches and techniques with a few other students, but not at a level of detail where specific solutions or implementation issues are described by anyone. If you choose to consult with other students, you will include the names of your discussion partners for each solution. If you have any questions on this, please ask the lecturer before you act. |
| Solution | To be presented in the tutorials. |

This assignment has two parts. In the first part, you apply logistic regression to given data (maximal 13 marks). In the second part, you answer a number of questions (maximal 7 marks). All formulas and calculations which are not part of Python code should be written using $\LaTeX$.

$\newcommand{\dotprod}[2]{\left\langle #1, #2 \right\rangle}$
$\newcommand{\onevec}{\mathbb{1}}$

Setting up the environment

In [5]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

## The data set


The data set contains mass-spectrometric data which are used to distinguish between cancer and normal patterns (https://archive.ics.uci.edu/ml/datasets/Arcene). 

Please download the following data:
* training data https://archive.ics.uci.edu/ml/machine-learning-databases/arcene/ARCENE/arcene_train.data,
* training labels https://archive.ics.uci.edu/ml/machine-learning-databases/arcene/ARCENE/arcene_train.labels,
* validation data https://archive.ics.uci.edu/ml/machine-learning-databases/arcene/ARCENE/arcene_valid.data, and
* validation labels https://archive.ics.uci.edu/ml/machine-learning-databases/arcene/arcene_valid.labels.

The following code reads the training and validation data into your workspace.

In [6]:
X_train = np.loadtxt("arcene_train.data")
y_train = np.loadtxt("arcene_train.labels")
X_val   = np.loadtxt("arcene_valid.data")
y_val   = np.loadtxt("arcene_valid.labels")

print(X_train.shape)
print(X_val.shape)

(100, 10000)
(100, 10000)


As the 10000-dimensional input space might lead to long computation times, we have prepared a subset of 220 features. Unless otherwise stated, you will from now on work only with these subset of the training and validation data.

In [4]:
feature_mask = np.array(
    [   4,   37,   85,  187,  233,  255,  375,  413,  435,  468,
      470,  477,  519,  528,  628,  629,  643,  661,  678,  682,
      750,  783,  786,  802,  936,  997, 1035, 1043, 1113, 1186,
     1288, 1294, 1316, 1319, 1441, 1475, 1488, 1546, 1577, 1589,
     1666, 1671, 1739, 1761, 1830, 1848, 1881, 1882, 1975, 2057,
     2116, 2157, 2170, 2297, 2308, 2406, 2407, 2460, 2532, 2619,
     2632, 2634, 2644, 2656, 2697, 2717, 2771, 2817, 2865, 2937,
     3006, 3033, 3109, 3250, 3256, 3364, 3369, 3386, 3517, 3574,
     3611, 3643, 3660, 3702, 3777, 3783, 3856, 4008, 4016, 4036,
     4058, 4081, 4147, 4157, 4182, 4197, 4202, 4230, 4251, 4305,
     4379, 4440, 4454, 4467, 4485, 4555, 4557, 4579, 4585, 4607,
     4685, 4702, 4721, 4730, 4894, 4899, 4954, 4959, 5004, 5048,
     5076, 5200, 5230, 5242, 5249, 5306, 5355, 5472, 5476, 5631,
     5720, 5773, 5790, 5936, 5994, 6106, 6111, 6162, 6163, 6192,
     6304, 6350, 6402, 6407, 6439, 6462, 6480, 6494, 6522, 6555,
     6596, 6620, 6678, 6773, 6791, 6869, 6888, 6889, 6904, 6927,
     6957, 6961, 7101, 7196, 7214, 7271, 7279, 7297, 7425, 7431,
     7436, 7462, 7505, 7512, 7627, 7651, 7747, 7793, 7812, 7855,
     7856, 7860, 7866, 7932, 7976, 7993, 8006, 8131, 8155, 8257,
     8266, 8270, 8367, 8378, 8440, 8472, 8501, 8726, 8761, 8829,
     8831, 8903, 9021, 9024, 9026, 9060, 9081, 9116, 9211, 9214,
     9233, 9319, 9371, 9506, 9539, 9549, 9603, 9616, 9633, 9703])
X_train_sub = X_train[:, feature_mask]
X_val_sub   = X_val[:, feature_mask]

print(X_train_sub.shape)
print(X_val_sub.shape)

(100, 220)
(100, 220)


## (2 points) Normalise the input data
Find a linear transformation of the training data resulting in a zero mean and unit variance. Report the parameters of the linear transformation for the first ten dimensions of the input data.

In general: 
* Under which circumstances does working with this transformed data lead to an advantage? 
* When is it counterproductive to normalise input data to zero mean and/or to unit variance?


In [34]:
X_train_sub_mean = [ X_train_sub[:,col].mean() for col in np.arange(X_train_sub.shape[1])]
X_train_sub_sd = [ X_train_sub[:,col].std() for col in np.arange(X_train_sub.shape[1])]

print("Mean for first 10 dimension:", X_train_sub_mean[:10])
print("Standard Deviation for first 10 dimension:", X_train_sub_sd[:10])

def normalize(X):
    return (X - X_train_sub_mean)/X_train_sub_sd

X_train_sub_norm = normalize(X_train_sub)

Mean for first 10 dimension: [17.109999999999999, 2.4199999999999999, 19.399999999999999, 2.8199999999999998, 35.850000000000001, 24.829999999999998, 70.010000000000005, 84.939999999999998, 20.059999999999999, 16.98]
Standard Deviation for first 10 dimension: [26.215985581320417, 8.2694377076074517, 24.963172875257666, 9.0811673258452856, 42.418716387934232, 28.888078856164874, 75.096803527180839, 86.423934184923567, 25.67637824927807, 26.601872114571183]
[[-0.65265523 -0.29264384 -0.7771448  ..., -0.95926957  1.57272964
   1.67092457]
 [ 1.6360247  -0.29264384 -0.7771448  ...,  0.20705545  1.61340018
  -0.59328979]
 [-0.65265523 -0.29264384 -0.7771448  ..., -0.88064092  2.34546983
  -0.59328979]
 ..., 
 [-0.65265523 -0.29264384 -0.17625965 ..., -0.91995525 -0.50146771
  -0.59328979]
 [ 1.97932669 -0.29264384 -0.21631866 ..., -0.67096451  1.12535374
  -0.59328979]
 [-0.65265523 -0.29264384 -0.7771448  ...,  0.32499843  0.10859033
  -0.59328979]]


### Solution

## (1 point) Error function for logistic regression with quadratic regularisation
Define a Python function calculating the *cross-entropy* $ E(\mathbf{w}) = - \ln p(\mathbf{t} \;|\; \mathbf{w}) $for logistic regression with two classes and quadratic (l2) regularisation for a given parameter vector $\mathbf{w}$.

In [None]:
# Solution goes here

## (1 point) Gradient for logistic regression with quadratic regularisation
Define a Python function calculating the gradient of the above *cross-entropy* for logistic regression with two classes and quadratic (l2) regularisation.

In [None]:
# Solution goes here

## (1 point) Finding the optimal parameters for given regularisation
Using the error function and the gradient defined above, you now setup the optimisation finding the optimal parameter vector $\mathbf{w}^\star$ for the training data and a fixed regularisation constant $\lambda $.
Use the function *scipy.optimize.fmin_bfgs* as optimiser.

For each $\lambda= 10^k, k=-3,-2,..,1$, report the first 10 components of the optimal parameter $\mathbf{w}^\star$ found.

In [5]:
# Solution goes here

## (2 points) Evaluating the solution with the validation data
So far, you have only used the training data. You now apply the learned model to the validation data and compare the prediction you get with the given validation labels. 

Report the performance measures
* the number of false positives (FP),
* the number of false negatives (FN),
* the number of true positives (TP),
* the number of true negatives (TN),
* the error rate,
* the specificity, 
* the sensitivity,

for the different settings of $\lambda = 10^k, k=-3,-2,..,1$.

In [None]:
# Solution goes here

## (3 points) Finding an optimal regularisation constant
We now consider the regularisation constant as a hyperparameter which we want to optimise.
For this task we use the training and validation data together.

Implement *s-fold cross-validation* with $s = 10$ to find an optimal regularisation constant which further reduces the error rates found in the previous question. Report the 
* optimal setting for the regularisation constant,
* the first 10 components of the optimal parameter $\mathbf{w}^\star$, and 
* the same performance measures as specified in the previous question which you achieved with those settings.

In [None]:
# Solution goes here

## (3 points) Feature selection from all 10000 features

In this task, you will use all 10000 features of the input data.

The goal is to find a subset of the 10000 features which improves the solution found so far.

An improvement is when at least one of
* the error, or
* the size of your chosen subset

decreases.

Please explain your approach and the reasons why you have chosen it.
Provide code to find the subset and report the number of features and the error you are able to achive.

### Solution

In [6]:
# Solution goes here

## (2 points) Maximum likelihood and maximum a posteriori (MAP)
We assume data samples $X_n = \{ x_1,\dots,x_n \}$ were generated i.i.d. from a uniform distribution with unknown positive parameter $\theta$:
$$
   \mathcal{U}(x \;|\; 0, \theta) = 
\begin{cases}
 1/\theta & 0 \leq x \leq \theta \\
 0        & \textrm{otherwise}   \\
\end{cases}
$$

a) We now observe four data samples $ X_4 = \{ 6, 8, 9, 5\}$.
Calculate $\theta_{ML}$, the maximum likelihood estimate of $\theta$ for the observed data.

b) Calculate the posterior distribution of $\theta$ 
given that the data $ X_4 = \{ 6, 8, 9, 5 \}$ have been observed. As prior for $\theta$
use $p(\theta) = \mathcal{U}(x \;|\; 0, 10)$.

c) Calculate $\theta_{MAP}$, the maximum a posteriori estimate of $\theta$ given the data $ X_4 $ and the prior $p(\theta)$ as in the previous question.

Write down the calculations in $\LaTeX$.

### Solution

## (1 point) Variance of sum of random vartiables
Prove that the following holds for the variance of a sum of two random variables
$ X $ and $ Y $
$$
\operatorname{var}[X + Y] = \operatorname{var}[X] + \operatorname{var}[Y] + 2 \operatorname{cov}[X,Y],
$$
where $ \operatorname{cov}[X,Y] $ is the covariance between $X$ and $Y$.
  
For each step in your proof, provide a verbal explanation why this transformation step holds.

### Solution

## (1 point) Matrix-vector identity proof
Given a nonsingular matrix $ \mathbf{A} $ and a vector $ \mathbf{v} $ of comparable
dimension, prove the following identity:
$$
 (\mathbf{A} + \mathbf{v} \mathbf{v}^T)^{-1} 
   = \mathbf{A}^{-1} - \frac{(\mathbf{A}^{-1} \mathbf{v}) (\mathbf{v}^T \mathbf{A}^{-1})}
                       {1 + \mathbf{v}^T \mathbf{A}^{-1} \mathbf{v}}.
$$

### Solution

## (3 points) Change of variance
In Bayesian Linear Regression, the predictive distribution 
with a simplified prior 
  $ p(\mathbf{w}  \;|\;  \alpha) = \mathcal{N}(\mathbf{w} \;|\; \mathbf{0}, \alpha^{-1}\mathbf{I}) $
is a Gaussian distribution,
$$ 
p(t  \;|\;  \mathbf{x}, \mathbf{t}, \alpha, \beta) 
= \mathcal{N} (t \;|\; \mathbf{m}_N^T \boldsymbol{\mathsf{\phi}}(\mathbf{x}), \sigma_N^2(\mathbf{x})) 
$$
with variance
$$
  \sigma_N^2(\mathbf{x}) = \frac{1}{\beta} + \boldsymbol{\mathsf{\phi}}(\mathbf{x})^T \mathbf{S}_N \boldsymbol{\mathsf{\phi}}(\mathbf{x}).
$$

After using another training pair $ \left( \mathbf{x}_{N+1}, t_{N+1} \right) $ to adapt ($=$learn) the model,
the variance of the predictive distribution becomes

$$
  \sigma_{N+1}^2(\mathbf{x}) = \frac{1}{\beta} + \boldsymbol{\mathsf{\phi}}(\mathbf{x})^T \mathbf{S}_{N+1} \boldsymbol{\mathsf{\phi}}(\mathbf{x}).
$$

a) Define the dimensions of the variables.

b) Prove that the uncertainties $ \sigma_N^2(\mathbf{x}) $ and
$ \sigma_{N+1}^2(\mathbf{x}) $ associated with the
predictive distributions satisfy

$$
  \sigma_{N+1}^2(\mathbf{x}) \le \sigma_N^2(\mathbf{x}).
$$
*Hint: Use the Matrix-vector identity proved in the previous question.*

c) Explain the meaning of this inequality.



### Solution