# Lagrange multipliers: PCA

###### COMP4670/8600 - Statistical Machine Learning - Tutorial

In this lab we will apply Lagrange multipliers to solve and implement principal component analysis (PCA)

### Assumed knowledge:

- Optimisation in Python (lab)
- PCA (IML and lecture)

### After this lab, you should be comfortable with:

- Applying Lagrange multipliers to optimisation problems
- Implementing solutions with Lagrange multipliers

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt

%matplotlib inline

## Data loading

For this lab we will use a spam email dataset. This dataset contains 57 features which are word and character frequencies, as well as whether an email is spam or not. The last column of the dataset is the label, which indicates whether or not something is spam.

Load and shuffle the data. Currently, the labels are an element of {0, 1}. Convert the labels into {-1 ,1} (as preparation for SVM). Then normalise the data to have zero mean and unit variance.

In [2]:
import csv

# Load the data into features and labels.
data = np.genfromtxt('spambase.csv', delimiter=',',skip_header=1)
indices = np.arange(len(data))
np.random.shuffle(indices)
features = np.nan_to_num(data[indices, :-1])
# Convert binary labels into {-1, 1} (for the SVM later in the lab).
labels = data[indices, -1].astype(bool) * 2 - 1

# Normalise the data.
features -= features.mean(axis=0, keepdims=True)
features /= features.std(axis=0, keepdims=True)

N = len(features)

features.shape

(4601, 57)

## Lagrange multipliers

One way to solve optimisation problems with equality constraints is to use Lagrange multipliers. We can write such optimisation problems in the following form:

\begin{align*}
    \underset{x}{\mathrm{maximise}}&\ f(x)\\
    \mathrm{such\ that}&\ h(x) = 0.
\end{align*}

We have an *objective function* $f : \mathbb{R}^n \to \mathbb{R}$, a set of (vectorised) *equality constraints* $h : \mathbb{R}^n \to \mathbb{R}^m$, and an *optimisation parameter* $x \in \mathbb{R}^n$.

*Lagrange multipliers* are additional parameters $\lambda \in \mathbb{R}^m$, which we introduce by defining the *Lagrange function* $\mathcal L$ of this problem. The Lagrange function is

$$
    \mathcal L(x, \lambda) = f(x) + \lambda \cdot h(x).
$$

Note that for *equality* constraints, the "+" can be replaced with "-". For *inequality* constraints $g(x) \le 0$, a "+" must be applied.

Once we have a Lagrange function, we then need to find its stationary points, i.e. $\nabla \mathcal L_{x, \lambda}(x^*, \lambda^*) = 0$.

*Also note that, $(x^*, \lambda^*)$ only gives necessary conditions for optimality. In settings with convex $f(x)$ and affine $h(x)$, it is ususally sufficient to claim the sufficiency for the optimality of $(x^*, \lambda^*)$. That is why in A1 we have to prove the convexity first.*

## 1. Principal component analysis

In PCA, we have $D$-dimensional observations $x_1, \dots, x_N \in \mathbb{R}^D$. We want to find a 1-dimensional subspace with maximum variance. We can represent this subspace by a normal unit vector $u \in \mathbb R^D$.

We can also write the set of observations in matrix form: $X \in \mathbb R^{N \times D}$.

Each observation is projected onto the subspace: for an observation $x$, the projected observation is $u^Tx$. The variance of the projected data is $u^TSu$ with covariance matrix $S = \frac{1}{N} \sum_{i = 1}^N \left(x_i - \bar x\right) \left(x_i - \bar x\right)^T$ where $\bar x$ is the mean observation.

We want to maximise the variance of observations projected onto $u$.

(If necessary, refer to week 6 lab for more detail)

### 1a.

Write down the optimisation problem in the form shown above, clearly showing the objective function and equality constraints.

### <span style="color:blue">Answer</span>
<i>--- replace this with your solution, add and remove code and markdown cells as appropriate ---</i>

### 1b.

Write down the Lagrange function for this problem.

### <span style="color:blue">Answer</span>
<i>--- replace this with your solution, add and remove code and markdown cells as appropriate ---</i>

### 1c.

By differentiating the Lagrange function with respect to $u$ and $\lambda$ and equating the derivatives to zero, show that the stationary points are unit eigenvectors of $S$.

### <span style="color:blue">Answer</span>
<i>--- replace this with your solution, add and remove code and markdown cells as appropriate ---</i>

### 1d.

Find the stationary point that maximises the variance.

### <span style="color:blue">Answer</span>
<i>--- replace this with your solution, add and remove code and markdown cells as appropriate ---</i>

### 1e.

In general, we can have a $k$-dimensional subspace. The $k$ principal components are the $k$ eigenvectors associated with the $k$ largest eigenvalues.

Using `np.linalg`, find the first two principal components of the spam email features, project the data onto these components, then scatter plot these components with `matplotlib`. Colour the points in the plot by their label.

In [None]:
# replace this with your solution, add and remove code and markdown cells as appropriate