$$ \LaTeX \text{ command declarations here.}
\newcommand{\R}{\mathbb{R}}
\renewcommand{\vec}[1]{\mathbf{#1}}
$$

# EECS 545:  Machine Learning
## Lecture 19:  Unsupervised Learning: PCA and ICA
* Instructor:  **Jacob Abernethy**
* Date:  March 28, 2016


*Lecture Exposition*: Saket


## References

This lecture draws from following resources:



## Classical PCA: Statement of the theorem

Suppose we want to find an orthogonal set of $L$ linear basis vectors $w_j \in \R^D$ , and the corresponding scores $z_i \in \R^L$ , such that we minimize the average **reconstruction error** 
$$ J(W,Z) = \frac{1}{N}\sum_{i=1}^{N} || x_i - Wz_i ||^2 $$

subject to constraint that $W$ is orthonormal.

The optimal solution is obtained by setting $\hat W = V_L$, where $V_L$ contains the $L$ eigenvectors with largest eigenvalues of empirical covariance matrix, $\hat \Sigma = \frac{1}{N} \sum_{i=1}^{N}(x_i - \bar x) (x_i-\bar x)^T$

(Proof: Murphy, Section 12.2.2)

<img src="m_1.png" align="middle">
An illustration of PCA where $D = 2$ and $L = 1$. Circles are the original data points, crosses are the reconstructions. The red dot is the data mean. The points are orthogonally projected onto the line.

The diagonal line is the vector $w_1$ ; this is called the first principal component or principal direction. The data points $x_i \in \R^2$ are orthogonally projected onto this line to get $z_i \in \R$. This is the best 1-dimensional approximation to the data.

## EM algorithm for PCA

Let $\tilde Z$ be $L \times N$ matrix storing the posterior means (low-dimensional representations)

Similarly, let $\tilde X = X^T$ store the original data along its columns. For $\sigma =0$:

** E-step **:
Notice that this is just an orthogonal projection of the data.
$$ \tilde Z = (W^TW)^{-1} W^T \tilde X $$

** M-step **
Here we exploit the fact that $\sigma = cov [z_i |x_i , \theta] = I$ when $\sigma^2 = 0$
$$ \hat W = [\sum_i x_i \mathbb{E}[z_i]^T][\sum_i \mathbb{E}[z_i] \mathbb{E}[z_i]^T]^{-1}$$
$$ W = \tilde X \tilde Z^T (\tilde Z \tilde Z^T)^{-1}$$

## EM for PCA

example code: https://github.com/sbailey/empca/blob/master/empca.py 

<img src="m_2.png" align="middle">

Illustration of EM for PCA when $D = 2$ and $L = 1$. Green stars are the original data points, black circles are their reconstructions. The weight vector w is represented by blue line. (a) We start with a random initial guess of w. The E step is represented by the orthogonal projections. (b) We update the rod w in the M step, keeping the projections onto the rod (black circles) fixed. (c) Another E step. The black circles can ’slide’ along the rod, but the rod stays fixed. (d) Another M step

## Advantages of EM over eigen-vector methods

* EM can be faster. In particular, assuming $N$, $D \gg L$, the dominant cost of EM is the projection operation in the E step, so the overall time is $O(T L N D)$, where $T$ is the number of iterations.

* EM can be implemented in an online fashion, i.e., we can update our estimate of $W$ as the data streams in.

* EM can handle missing data in a simple way.

* EM can be extended to handle mixtures of PPCA/ FA models.

* EM can be modified to variational EM or to variational Bayes EM to fit more complex models.

## ICA

* Consider the following situation. You are in a crowded room and many people are speaking. Your ears essentially act as two microphones, which are listening to a linear combination of the different speech signals in the room. Your goal is to deconvolve the mixed signals into their constituent parts. This is known as the cocktail party problem, and is an example of blind signal separation (BSS), or blind source separation, where “blind” means we know “nothing” about the source of the signals. 


* Applications 
    * acoustic signal processing
    * analysing EEG and MEG signals
    * financial data,
    * any other dataset (not necessarily temporal) where latent sources or factors get mixed together in a linear way.

## FastICA

* An approximate Newton method for fitting ICA models.

* Assume all source distributions are known and are the same, so we can just write $G(z) = − log p(z)$. Let $g(z) = \frac {d}{dz} G(z)$. The constrained objective, and its gradient and Hessian are given by:

$$\begin{align}
 & f(v) = \mathbb{E}[G(v^Tx)] + \lambda (1-v^T v) \\
 & \nabla f(v) = \mathbb{E}[xg(v^Tx)] -\beta v \\
 & H(v) = \mathbb{E} [xx^T g'(v^T x)]- \beta I 
\end{align}
$$

where $ \beta = 2 \lambda$ is lagrange multiplier. 

* Let us make the approximation

$$ \mathbb{E} [xx^T g'(v^T x)]  \approx \mathbb{E}[xx^T] \mathbb{E}[g'(v^T x)] = \mathbb{E}['g'(v^T x)] $$

* This makes the Hessian very easy to invert, giving rise to the following Newton update:

$$ v^* := v - \frac {\mathbb{E}[xg(v^Tx)] - \beta v}{\mathbb{E}[xg'(v^Tx)] - \beta} $$

* $$ v^{new} := \frac{v*}{|| v* ||}$$