In [0]:
import numpy as np
from sklearn.decomposition import PCA 


In [0]:
np.random.seed(4)
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

# PCA using Scikit-Learn

In [0]:
pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

In [14]:
X2D[:5]

array([[ 1.26203346,  0.42067648],
       [-0.08001485, -0.35272239],
       [ 1.17545763,  0.36085729],
       [ 0.89305601, -0.30862856],
       [ 0.73016287, -0.25404049]])

In [23]:
X.shape

(60, 3)

It is trivial to do PCA using Scikit-Learn. But what is the algorithm behind the result? It can be explained by the sigular value decomposition.

# PCA using Numpy with SVD (sigular value decompsition)

In [0]:
X_centered = X - X.mean(axis=0)
U, s, V = np.linalg.svd(X_centered)
w2 = V.T[:, :2]

In [0]:
X2D_svd = X_centered.dot(w2)

In [18]:
X2D_svd[:5]

array([[-1.26203346, -0.42067648],
       [ 0.08001485,  0.35272239],
       [-1.17545763, -0.36085729],
       [-0.89305601,  0.30862856],
       [-0.73016287,  0.25404049]])

So we get the same result except with the opposite sign. But this brings even more questions. Why should the data be centred? Why can SVD implement PCA?

# Mathematics Behind PCA with SVD

PCA is to find the axis that account for the largest amount of variance for the training set. In the above example, it is to find such an axis for the set, X. The variance of X along its first axis (column) is var(X[:,0]). 

The sum of all of the variance along each axis (column) are:
    
    var_total = np.trace(X_centred.T @ X_centered) = np.trace(X_centred @ X_centered.T).
Denoting the eigenvalues of (X_centred.T @ X_centered) as $\lambda_i$, then we also have

var_total = $\sum_i\lambda_i$

And the variance along an arbitrary axis can be expressed as

var = $\sum_i\alpha_i\lambda_i$ 

under the condition 

$\sum_i\alpha_i=1$.

So the largest variance can be achieived by along the eigenvector of the largest eigenvalue.

By reading the documents about np.linalg.svd, we know that "s" returned by np.linalg.svd(X_cnetered) is the vector of eigenvalues of (X_centred.T @ X_centered) or (X_centred.T @ X_centered) in the descending order. The rows in "V" are the corresponding eigenvectors. So 

X_centered.dot(V.T[:, :2]) 

is what we want for PCA with the largest two variances!